UTF-8 (8-bit Unicode Transformation Format) es un formato de codificación de caracteres Unicode e ISO 10646 utilizando símbolos de longitud variable. UTF-8 fue creado por Robert C. Pike y Kenneth L. Thompson. Está definido como estándar por la RFC 3629 de la Internet Engineering Task Force (IETF). Actualmente es una de las tres posibilidades de codificación reconocidas por Unicode y lenguajes web, o cuatro en ISO 10646.
Sus características principales son:
Es capaz de representar cualquier carácter Unicode.
Usa símbolos de longitud variable (de 1 a 4 bytes por carácter Unicode).
Incluye la especificación US-ASCII de 7 bits, por lo que cualquier mensaje ASCII se representa sin cambios.
Incluye sincronía. Es posible determinar el inicio de cada símbolo sin reiniciar la lectura desde el principio de la comunicación.
No superposición. Los conjuntos de valores que puede tomar cada byte de un carácter multibyte, son disjuntos, por lo que no es posible confundirlos entre sí.
Estas características lo hacen atractivo en la codificación de correos electrónicos y páginas web.El IETF requiere que todos los protocolos de Internet indiquen qué codificación utilizan para los textos y que UTF-8 sea una de las codificaciones contempladas.4 El Internet Mail Consortium (IMC) recomienda que todos los programas de correo electrónico sean capaces de crear y mostrar mensajes codificados utilizando UTF-8
Ventajas del codigo UTF-8
- UTF-8 permite codificar cualquier carácter Unicode.
- Es compatible con US-ASCII, la codificación del repertorio de 7 bits es directa.
- Fácil identificación. Es posible identificar claramente una muestra de datos como UTF-8 mediante un sencillo algoritmo. La probabilidad de una identificación correcta aumenta con el tamaño de la muestra.
- UTF-8 ahorrará espacio de almacenamiento para textos en caracteres latinos, donde los caracteres incluidos en US-ASCII son comunes, cuando se compara con otros formatos como UTF-16.
- Una secuencia de bytes para un carácter jamás será parte de una secuencia más larga de otro carácter por contener información de sincronización.
Desventajas del codigo UTF-8
- UTF-8 utiliza símbolos de longitud variable; eso significa que diferentes caracteres pueden codificarse con distinto número de bytes. Es necesario recorrer la cadena desde el inicio para encontrar el carácter que ocupa una determinada posición.
- Los caracteres ideográficos usan 3 bytes en UTF-8, pero sólo 2 en UTF-16. Así, los textos chinos, japoneses o coreanos ocupan más espacio cuando se representan en UTF-8.
- UTF-8 ofrece peor rendimiento que UTF-16 y UTF-32 en cuanto a coste de computación, por ejemplo en operaciones de ordenación.
Tabla de codigos UTF-8
Unicode code point |
character | UTF-8 (hex.) |
name |
---|---|---|---|
U+0000 | 00 | <control> | |
U+0001 | 01 | <control> | |
U+0002 | 02 | <control> | |
U+0003 | 03 | <control> | |
U+0004 | 04 | <control> | |
U+0005 | 05 | <control> | |
U+0006 | 06 | <control> | |
U+0007 | 07 | <control> | |
U+0008 | 08 | <control> | |
U+0009 | 09 | <control> | |
U+000A | 0a | <control> | |
U+000B | 0b | <control> | |
U+000C | 0c | <control> | |
U+000D | 0d | <control> | |
U+000E | 0e | <control> | |
U+000F | 0f | <control> | |
U+0010 | 10 | <control> | |
U+0011 | 11 | <control> | |
U+0012 | 12 | <control> | |
U+0013 | 13 | <control> | |
U+0014 | 14 | <control> | |
U+0015 | 15 | <control> | |
U+0016 | 16 | <control> | |
U+0017 | 17 | <control> | |
U+0018 | 18 | <control> | |
U+0019 | 19 | <control> | |
U+001A | 1a | <control> | |
U+001B | 1b | <control> | |
U+001C | 1c | <control> | |
U+001D | 1d | <control> | |
U+001E | 1e | <control> | |
U+001F | 1f | <control> | |
U+0020 | 20 | SPACE | |
U+0021 | ! | 21 | EXCLAMATION MARK |
U+0022 | « | 22 | QUOTATION MARK |
U+0023 | # | 23 | NUMBER SIGN |
U+0024 | $ | 24 | DOLLAR SIGN |
U+0025 | % | 25 | PERCENT SIGN |
U+0026 | & | 26 | AMPERSAND |
U+0027 | ‘ | 27 | APOSTROPHE |
U+0028 | ( | 28 | LEFT PARENTHESIS |
U+0029 | ) | 29 | RIGHT PARENTHESIS |
U+002A | * | 2a | ASTERISK |
U+002B | + | 2b | PLUS SIGN |
U+002C | , | 2c | COMMA |
U+002D | – | 2d | HYPHEN-MINUS |
U+002E | . | 2e | FULL STOP |
U+002F | / | 2f | SOLIDUS |
U+0030 | 0 | 30 | DIGIT ZERO |
U+0031 | 1 | 31 | DIGIT ONE |
U+0032 | 2 | 32 | DIGIT TWO |
U+0033 | 3 | 33 | DIGIT THREE |
U+0034 | 4 | 34 | DIGIT FOUR |
U+0035 | 5 | 35 | DIGIT FIVE |
U+0036 | 6 | 36 | DIGIT SIX |
U+0037 | 7 | 37 | DIGIT SEVEN |
U+0038 | 8 | 38 | DIGIT EIGHT |
U+0039 | 9 | 39 | DIGIT NINE |
U+003A | : | 3a | COLON |
U+003B | ; | 3b | SEMICOLON |
U+003C | < | 3c | LESS-THAN SIGN |
U+003D | = | 3d | EQUALS SIGN |
U+003E | > | 3e | GREATER-THAN SIGN |
U+003F | ? | 3f | QUESTION MARK |
U+0040 | @ | 40 | COMMERCIAL AT |
U+0041 | A | 41 | LATIN CAPITAL LETTER A |
U+0042 | B | 42 | LATIN CAPITAL LETTER B |
U+0043 | C | 43 | LATIN CAPITAL LETTER C |
U+0044 | D | 44 | LATIN CAPITAL LETTER D |
U+0045 | E | 45 | LATIN CAPITAL LETTER E |
U+0046 | F | 46 | LATIN CAPITAL LETTER F |
U+0047 | G | 47 | LATIN CAPITAL LETTER G |
U+0048 | H | 48 | LATIN CAPITAL LETTER H |
U+0049 | I | 49 | LATIN CAPITAL LETTER I |
U+004A | J | 4a | LATIN CAPITAL LETTER J |
U+004B | K | 4b | LATIN CAPITAL LETTER K |
U+004C | L | 4c | LATIN CAPITAL LETTER L |
U+004D | M | 4d | LATIN CAPITAL LETTER M |
U+004E | N | 4e | LATIN CAPITAL LETTER N |
U+004F | O | 4f | LATIN CAPITAL LETTER O |
U+0050 | P | 50 | LATIN CAPITAL LETTER P |
U+0051 | Q | 51 | LATIN CAPITAL LETTER Q |
U+0052 | R | 52 | LATIN CAPITAL LETTER R |
U+0053 | S | 53 | LATIN CAPITAL LETTER S |
U+0054 | T | 54 | LATIN CAPITAL LETTER T |
U+0055 | U | 55 | LATIN CAPITAL LETTER U |
U+0056 | V | 56 | LATIN CAPITAL LETTER V |
U+0057 | W | 57 | LATIN CAPITAL LETTER W |
U+0058 | X | 58 | LATIN CAPITAL LETTER X |
U+0059 | Y | 59 | LATIN CAPITAL LETTER Y |
U+005A | Z | 5a | LATIN CAPITAL LETTER Z |
U+005B | [ | 5b | LEFT SQUARE BRACKET |
U+005C | \ | 5c | REVERSE SOLIDUS |
U+005D | ] | 5d | RIGHT SQUARE BRACKET |
U+005E | ^ | 5e | CIRCUMFLEX ACCENT |
U+005F | _ | 5f | LOW LINE |
U+0060 | ` | 60 | GRAVE ACCENT |
U+0061 | a | 61 | LATIN SMALL LETTER A |
U+0062 | b | 62 | LATIN SMALL LETTER B |
U+0063 | c | 63 | LATIN SMALL LETTER C |
U+0064 | d | 64 | LATIN SMALL LETTER D |
U+0065 | e | 65 | LATIN SMALL LETTER E |
U+0066 | f | 66 | LATIN SMALL LETTER F |
U+0067 | g | 67 | LATIN SMALL LETTER G |
U+0068 | h | 68 | LATIN SMALL LETTER H |
U+0069 | i | 69 | LATIN SMALL LETTER I |
U+006A | j | 6a | LATIN SMALL LETTER J |
U+006B | k | 6b | LATIN SMALL LETTER K |
U+006C | l | 6c | LATIN SMALL LETTER L |
U+006D | m | 6d | LATIN SMALL LETTER M |
U+006E | n | 6e | LATIN SMALL LETTER N |
U+006F | o | 6f | LATIN SMALL LETTER O |
U+0070 | p | 70 | LATIN SMALL LETTER P |
U+0071 | q | 71 | LATIN SMALL LETTER Q |
U+0072 | r | 72 | LATIN SMALL LETTER R |
U+0073 | s | 73 | LATIN SMALL LETTER S |
U+0074 | t | 74 | LATIN SMALL LETTER T |
U+0075 | u | 75 | LATIN SMALL LETTER U |
U+0076 | v | 76 | LATIN SMALL LETTER V |
U+0077 | w | 77 | LATIN SMALL LETTER W |
U+0078 | x | 78 | LATIN SMALL LETTER X |
U+0079 | y | 79 | LATIN SMALL LETTER Y |
U+007A | z | 7a | LATIN SMALL LETTER Z |
U+007B | { | 7b | LEFT CURLY BRACKET |
U+007C | | | 7c | VERTICAL LINE |
U+007D | } | 7d | RIGHT CURLY BRACKET |
U+007E | ~ | 7e | TILDE |
U+007F | 7f | <control> | |
U+0080 | c2 80 | <control> | |
U+0081 | c2 81 | <control> | |
U+0082 | c2 82 | <control> | |
U+0083 | c2 83 | <control> | |
U+0084 | c2 84 | <control> | |
U+0085 | c2 85 | <control> | |
U+0086 | c2 86 | <control> | |
U+0087 | c2 87 | <control> | |
U+0088 | c2 88 | <control> | |
U+0089 | c2 89 | <control> | |
U+008A | c2 8a | <control> | |
U+008B | c2 8b | <control> | |
U+008C | c2 8c | <control> | |
U+008D | c2 8d | <control> | |
U+008E | c2 8e | <control> | |
U+008F | c2 8f | <control> | |
U+0090 | c2 90 | <control> | |
U+0091 | c2 91 | <control> | |
U+0092 | c2 92 | <control> | |
U+0093 | c2 93 | <control> | |
U+0094 | c2 94 | <control> | |
U+0095 | c2 95 | <control> | |
U+0096 | c2 96 | <control> | |
U+0097 | c2 97 | <control> | |
U+0098 | c2 98 | <control> | |
U+0099 | c2 99 | <control> | |
U+009A | c2 9a | <control> | |
U+009B | c2 9b | <control> | |
U+009C | c2 9c | <control> | |
U+009D | c2 9d | <control> | |
U+009E | c2 9e | <control> | |
U+009F | c2 9f | <control> | |
U+00A0 | c2 a0 | NO-BREAK SPACE | |
U+00A1 | ¡ | c2 a1 | INVERTED EXCLAMATION MARK |
U+00A2 | ¢ | c2 a2 | CENT SIGN |
U+00A3 | £ | c2 a3 | POUND SIGN |
U+00A4 | ¤ | c2 a4 | CURRENCY SIGN |
U+00A5 | ¥ | c2 a5 | YEN SIGN |
U+00A6 | ¦ | c2 a6 | BROKEN BAR |
U+00A7 | § | c2 a7 | SECTION SIGN |
U+00A8 | ¨ | c2 a8 | DIAERESIS |
U+00A9 | © | c2 a9 | COPYRIGHT SIGN |
U+00AA | ª | c2 aa | FEMININE ORDINAL INDICATOR |
U+00AB | « | c2 ab | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK |
U+00AC | ¬ | c2 ac | NOT SIGN |
U+00AD | | c2 ad | SOFT HYPHEN |
U+00AE | ® | c2 ae | REGISTERED SIGN |
U+00AF | ¯ | c2 af | MACRON |
U+00B0 | ° | c2 b0 | DEGREE SIGN |
U+00B1 | ± | c2 b1 | PLUS-MINUS SIGN |
U+00B2 | ² | c2 b2 | SUPERSCRIPT TWO |
U+00B3 | ³ | c2 b3 | SUPERSCRIPT THREE |
U+00B4 | ´ | c2 b4 | ACUTE ACCENT |
U+00B5 | µ | c2 b5 | MICRO SIGN |
U+00B6 | ¶ | c2 b6 | PILCROW SIGN |
U+00B7 | · | c2 b7 | MIDDLE DOT |
U+00B8 | ¸ | c2 b8 | CEDILLA |
U+00B9 | ¹ | c2 b9 | SUPERSCRIPT ONE |
U+00BA | º | c2 ba | MASCULINE ORDINAL INDICATOR |
U+00BB | » | c2 bb | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
U+00BC | ¼ | c2 bc | VULGAR FRACTION ONE QUARTER |
U+00BD | ½ | c2 bd | VULGAR FRACTION ONE HALF |
U+00BE | ¾ | c2 be | VULGAR FRACTION THREE QUARTERS |
U+00BF | ¿ | c2 bf | INVERTED QUESTION MARK |
U+00C0 | À | c3 80 | LATIN CAPITAL LETTER A WITH GRAVE |
U+00C1 | Á | c3 81 | LATIN CAPITAL LETTER A WITH ACUTE |
U+00C2 | Â | c3 82 | LATIN CAPITAL LETTER A WITH CIRCUMFLEX |
U+00C3 | Ã | c3 83 | LATIN CAPITAL LETTER A WITH TILDE |
U+00C4 | Ä | c3 84 | LATIN CAPITAL LETTER A WITH DIAERESIS |
U+00C5 | Å | c3 85 | LATIN CAPITAL LETTER A WITH RING ABOVE |
U+00C6 | Æ | c3 86 | LATIN CAPITAL LETTER AE |
U+00C7 | Ç | c3 87 | LATIN CAPITAL LETTER C WITH CEDILLA |
U+00C8 | È | c3 88 | LATIN CAPITAL LETTER E WITH GRAVE |
U+00C9 | É | c3 89 | LATIN CAPITAL LETTER E WITH ACUTE |
U+00CA | Ê | c3 8a | LATIN CAPITAL LETTER E WITH CIRCUMFLEX |
U+00CB | Ë | c3 8b | LATIN CAPITAL LETTER E WITH DIAERESIS |
U+00CC | Ì | c3 8c | LATIN CAPITAL LETTER I WITH GRAVE |
U+00CD | Í | c3 8d | LATIN CAPITAL LETTER I WITH ACUTE |
U+00CE | Î | c3 8e | LATIN CAPITAL LETTER I WITH CIRCUMFLEX |
U+00CF | Ï | c3 8f | LATIN CAPITAL LETTER I WITH DIAERESIS |
U+00D0 | Ð | c3 90 | LATIN CAPITAL LETTER ETH |
U+00D1 | Ñ | c3 91 | LATIN CAPITAL LETTER N WITH TILDE |
U+00D2 | Ò | c3 92 | LATIN CAPITAL LETTER O WITH GRAVE |
U+00D3 | Ó | c3 93 | LATIN CAPITAL LETTER O WITH ACUTE |
U+00D4 | Ô | c3 94 | LATIN CAPITAL LETTER O WITH CIRCUMFLEX |
U+00D5 | Õ | c3 95 | LATIN CAPITAL LETTER O WITH TILDE |
U+00D6 | Ö | c3 96 | LATIN CAPITAL LETTER O WITH DIAERESIS |
U+00D7 | × | c3 97 | MULTIPLICATION SIGN |
U+00D8 | Ø | c3 98 | LATIN CAPITAL LETTER O WITH STROKE |
U+00D9 | Ù | c3 99 | LATIN CAPITAL LETTER U WITH GRAVE |
U+00DA | Ú | c3 9a | LATIN CAPITAL LETTER U WITH ACUTE |
U+00DB | Û | c3 9b | LATIN CAPITAL LETTER U WITH CIRCUMFLEX |
U+00DC | Ü | c3 9c | LATIN CAPITAL LETTER U WITH DIAERESIS |
U+00DD | Ý | c3 9d | LATIN CAPITAL LETTER Y WITH ACUTE |
U+00DE | Þ | c3 9e | LATIN CAPITAL LETTER THORN |
U+00DF | ß | c3 9f | LATIN SMALL LETTER SHARP S |
U+00E0 | à | c3 a0 | LATIN SMALL LETTER A WITH GRAVE |
U+00E1 | á | c3 a1 | LATIN SMALL LETTER A WITH ACUTE |
U+00E2 | â | c3 a2 | LATIN SMALL LETTER A WITH CIRCUMFLEX |
U+00E3 | ã | c3 a3 | LATIN SMALL LETTER A WITH TILDE |
U+00E4 | ä | c3 a4 | LATIN SMALL LETTER A WITH DIAERESIS |
U+00E5 | å | c3 a5 | LATIN SMALL LETTER A WITH RING ABOVE |
U+00E6 | æ | c3 a6 | LATIN SMALL LETTER AE |
U+00E7 | ç | c3 a7 | LATIN SMALL LETTER C WITH CEDILLA |
U+00E8 | è | c3 a8 | LATIN SMALL LETTER E WITH GRAVE |
U+00E9 | é | c3 a9 | LATIN SMALL LETTER E WITH ACUTE |
U+00EA | ê | c3 aa | LATIN SMALL LETTER E WITH CIRCUMFLEX |
U+00EB | ë | c3 ab | LATIN SMALL LETTER E WITH DIAERESIS |
U+00EC | ì | c3 ac | LATIN SMALL LETTER I WITH GRAVE |
U+00ED | í | c3 ad | LATIN SMALL LETTER I WITH ACUTE |
U+00EE | î | c3 ae | LATIN SMALL LETTER I WITH CIRCUMFLEX |
U+00EF | ï | c3 af | LATIN SMALL LETTER I WITH DIAERESIS |
U+00F0 | ð | c3 b0 | LATIN SMALL LETTER ETH |
U+00F1 | ñ | c3 b1 | LATIN SMALL LETTER N WITH TILDE |
U+00F2 | ò | c3 b2 | LATIN SMALL LETTER O WITH GRAVE |
U+00F3 | ó | c3 b3 | LATIN SMALL LETTER O WITH ACUTE |
U+00F4 | ô | c3 b4 | LATIN SMALL LETTER O WITH CIRCUMFLEX |
U+00F5 | õ | c3 b5 | LATIN SMALL LETTER O WITH TILDE |
U+00F6 | ö | c3 b6 | LATIN SMALL LETTER O WITH DIAERESIS |
U+00F7 | ÷ | c3 b7 | DIVISION SIGN |
U+00F8 | ø | c3 b8 | LATIN SMALL LETTER O WITH STROKE |
U+00F9 | ù | c3 b9 | LATIN SMALL LETTER U WITH GRAVE |
U+00FA | ú | c3 ba | LATIN SMALL LETTER U WITH ACUTE |
U+00FB | û | c3 bb | LATIN SMALL LETTER U WITH CIRCUMFLEX |
U+00FC | ü | c3 bc | LATIN SMALL LETTER U WITH DIAERESIS |
U+00FD | ý | c3 bd | LATIN SMALL LETTER Y WITH ACUTE |
U+00FE | þ | c3 be | LATIN SMALL LETTER THORN |
U+00FF | ÿ | c3 bf | LATIN SMALL LETTER Y WITH DIAERESIS |
Codigo UTF-8 de Acentos
á -> á
é -> é
í -> í
ó -> ó
ú -> ú
ñ -> ñ
à -> à
è -> è
ò -> ò