BlogProductsAbout Us

Unicode to String

venomnert

Notes:

  • I would like this blog post and other Elixir related blog post to be treated as a living organisim. It will continue to grow as I gain new insight into the following topic.
  • When introducing a new idea or definition I will be try to provide explanation from many angles as possible. I will be using the ADEPT as a guidance.

Unicode to string

What is unicode?

Technical Definition:

"The Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world."

Source

Plain English & Example:

It's an internationally agreed upon way of associating integers to a character counterpart. I.E - In USA the code point 65 will represent the character 'A' and in Japan the same code point 65 will also represent the character 'A'.

What does standard mean within the context of programming?

Technical Definition:

"Software standards consist of certain terms, concepts, data formats, document styles and techniques agreed upon by software creators so that their software can understand the files and data created by a different computer program. To be considered a standard, a certain protocol needs to be accepted and incorporated by a group of developers who contribute to the definition and maintenance of the standard."

Source

Does unicode mean translation?

No, it's simply a internationally agreed upon way to map an integer (aka codepoint) to a character. Under unicode standard the code point 65 will map to the character 'A'; therefore, USA, Japan and other countries would have to adhere to that mapping.

What is code point?

Technical Definition:

The notion of a code point is used for abstraction, to distinguish both:

  • the number from an encoding as a sequence of bits, and
  • the abstract character from a particular graphical representation (glyph).

Source

Plain English & Example:

  • It is an integer value, that maps to a character within the Unicode (other encoding) standards.
  • E.G - 2947 is the code point for the Tamil character 'ஃ' in the Unicode standard. Note, in Unicode hex is mainly used to represent the code point 'ஃ' is commanly referred to as /u0B83 in elixir "\u{0B83}"

What is hexadecimal digit?

Technical Definition:

"Hexadecimal (also base 16, or hex**) is a positional system that represents numbers using a base of 16"

Source

What is encoding & UTF-8?

Technical Definition:

"It's the implementation used to convert code point integer to bytes."

Source

​ Elixir uses UTF-8 to encode its strings, which means that code points are encoded as a series of 8-bit bytes.

What is a bit?

Technical Definition:

A single unit of data that can be either 1 or 0

Source

What is 8-bit? What is byte?

Technical Definition:

A byte is 8 bits

Source

What is a bitstrings

  • A bitstring is a fundamental data type in Elixir, denoted with the <<>> syntax.

Q: What is a binary?

  • A: A binary is a bitstring where the number of bits is divisible by 8. A binary is a set of bytes.

Q: How does <<>> constructor work?

  • A bitstring is a contiguous sequence of bits in memory.

Q: What happens to bitstring when the size of bytes is less than the integer?

  • Any value that exceeds what can be stored by the number of bits provisioned is truncated, the left-most bit is ignored and the value becomes truncated

Q: What will the result be for the following <<10::3>>?

  • A: <<0::1, 1::1, 0::1>>

Q: Is every bitstring a binary and is binary a bitstring?

  • No because you have bitstring of any size (by default it's 1 byte). Yes, because all binary's are divisible by 8 which is a valid size of bitstrings

Q: Fundamentally I understand the difference between binary and bitstrings. But why make that distinction?

Q: What is a string?

Technical Definiton:

"A string is a UTF-8 encoded binary"

Source

Plain English:

  • A string is a sequence of code points what is stored using UTF-8 protocol.
  • Each unit of storage is 8 bits - binary.

Q: What is the rule used to encode a string to UTF-8?

  • See below section for answer.

Q: How to encode 'A' to UTF-8?

  1. Determine the character's code point, 'A' = 65
  2. Convert code point from decimal to binary: 65 = 1000001
  3. Then determine if it will require 1,2,3 or 4 bytes to represent the binary. 1 byte
  4. Then determine the encoding format for the above byte. 0 _ _ _ _ _ _ _ _. So our final binary representation with the encoding format is 01000001. We added a leading 0.
  5. Then convert the new binary to hex 01000001 = 41

Q: What defines a valid string? Why aren't all binaries a valid string?

  • UTF-8 encoding is used to define a valid string. If any binary that doesn't adhere to the specific UTF-8 formatting it is invalid.
  • Due to the UTF-8 standard encoding rules, not every binary is a valid string.

Problem Set:

Q: What is the hex for the following character? The code point is 38486

Q: In elixir what will the following expression print.

  1. iex(1)> 'A'
    1. Print the character 'A'.
  2. `iex(1)> ?A
    1. Print the code point for character 'A'.
  3. iex(1)> 65
    1. Print the integer 65.
  4. iex(1)> [65]
    1. Print character 'A', since a list of integer will be interpreted as code points.
  5. iex(1)> [[65]]
    1. Print a list of character. In this case, a list containing ['A'].
  6. iex(1)> 'CAT' == [?C, ?A, ?T]
    1. true will be printed. The left is a sequence of character, which is the same the right hand side. Based on our finding from question #4.
  7. 'CAT' == [67, 65, 84]
    1. true will be printed for the same reason as above.
  8. 'CAT' == [[67],[65],[84]]
    1. false will be printed. The right hand side represents a nested list. Similar to question #5.
  9. "CAT" == [67, 65, 84]
    1. false will be printed. The left is a string and the right is a list of characters.
  10. String.to_charlist("CAT") == [67, 65, 84]
    1. true will be printed. returns the code points, source
  11. String.codepoints("CAT") == [67, 65, 84]
    1. false will be printed . String.codepoints represents all code points as strings, source

Side Questions:

Q: What is the difference between Mb and MB?

"Mb stands for Megabits which is equal to 1,000,000 bits"

"MB stands for Megabytes which is equal to 8,000,000 bits"

Source

Resources:

← back to all posts