Understand different levels of data representation.
Introduce some common data formats: CSV, JSON, and XML.
How is data represented on a computer?
You may be aware that at the lowest level, computers represent data in binary.
We can consider a single byte (eight bits): 0110 0001
Each bit corresponds to a power of two.
128
64
32
16
8
4
2
1
0
1
1
0
0
0
0
1
So 0110 0001 represents 64+32+1, or 97.
This works for representing integers, but what about strings?
Those same bits 0110 0001 can also mean the character A.
We can use Python’s built in chr to convert an integer to its text representation:
for x inrange(32, 53):if x <47:print(f"{x:<3}{chr(x):<3}{x+20:<3}{chr(x+20):<3}{x+40:<3}{chr(x+40):<3}{x+60:<3}{chr(x+60):<3}{x+80:<3}{chr(x+80):<3}")else:print(f"{x:<3}{chr(x):<3}{x+20:<3}{chr(x+20):<3}{x+40:<3}{chr(x+40):<3}{x+60:<3}{chr(x+60):<3}")
32 52 4 72 H 92 \ 112 p
33 ! 53 5 73 I 93 ] 113 q
34 " 54 6 74 J 94 ^ 114 r
35 # 55 7 75 K 95 _ 115 s
36 $ 56 8 76 L 96 ` 116 t
37 % 57 9 77 M 97 a 117 u
38 & 58 : 78 N 98 b 118 v
39 ' 59 ; 79 O 99 c 119 w
40 ( 60 < 80 P 100 d 120 x
41 ) 61 = 81 Q 101 e 121 y
42 * 62 > 82 R 102 f 122 z
43 + 63 ? 83 S 103 g 123 {
44 , 64 @ 84 T 104 h 124 |
45 - 65 A 85 U 105 i 125 }
46 . 66 B 86 V 106 j 126 ~
47 / 67 C 87 W 107 k
48 0 68 D 88 X 108 l
49 1 69 E 89 Y 109 m
50 2 70 F 90 Z 110 n
51 3 71 G 91 [ 111 o
52 4 72 H 92 \ 112 p
These mappings are arbitrary, all that matters is that there is agreement between my computer and yours. The mapping we use for the Latin alphabet is derived from a specification known as ASCII.
Today, most of the time we use Unicode, which contains encodings for characters in virtually every language, as well as emojis.
Any symbol has a mapping, we can use ord() to convert a character back to it’s numeric form, the inverse of chr().
chars = ["漢", "👻", "֎"]for char in chars:print(char, ord(char))
漢 28450
👻 128123
֎ 1422
We call these encodings. A mapping of how we interpret a single byte as a character.
This is an important principle, data on its own is not enough to be interpreted, we also need to know the encoding.
To demonstrate further, let’s consider a run of 4 bytes (32 bits) 0110 0001 0110 0001 0110 0001 0110 0001:
As an integer, this is the number 1,633,771,873.
Interpreted as a float it is 259845894142441816064. 1
Interpreted as 4 separate bytes, it is 97 97 97 97.
These four bytes might mean "AAAA", or in an image format, might represent a single pixel (Red=97, Green=97, Blue=97, Alpha=97).
Ultimately, we need more information to interpret bits as data, we need to know how the data is meant to be interpreted.
Text Files
When you open a .py, .r, or .md file, the data itself is encoded in a text encoding like Unicode. These are all colloquially known as “text files”.
Text files can be edited by a wide variety of editors: simple ones like Windows Notepad, programmer-focused terminal-based editors like neovim, or highly-customizable GUIs like Visual Studio Code or Obsidian.
File Extensions
You may wonder how file extensions (the suffix after the .: .png, .py, .xlsx) interact with these concepts.
The data within a file is in no way impacted by the extension.
Extensions are primarily for human identification of the type of file. That said, some programs may get confused if you modify the extension.
This means if you have image.png and rename it to image.jpg it may still open in your image editor, but the image did not suddenly become a JPG.
To help understand this concept imagine taking a hello.py file and renamed it hello.r. It wouldn’t magically become R code. Thus the lack of a correct extension may confuse users/programs, but it does not affect the data or its encoding.
Serialization
If we want to store data for later retrieval or transmission we need to agree on a format.
If I wanted to represent users in my system I could store their data in files like:
people.txt
Adam is 41 he has 1 cats 0 dogs 1 fish 0 snakes
Kate is 26 she has 0 cats 2 dogs 0 fish 0 snakes
If every person was stored in the same format, we could write a program to interpret this, but it is not convenient. Any change to the format would result in a need to update all files and the parser code.
The generic term for what we’re doing here is serialization.
Serialization is the process of converting a data type in a given language, to a representation that can be shared between programs.
Whenever you are writing to disk, or sending data over a network, you need to serialize the outgoing data.
Deserialization is the inverse, taking data received over the network and converting it back to a representation you can work with.
This represents the same data, but in a more concise and consistent format.
We call CSV a machine-readable format, because it is meant to be interpreted by code. The fact that in this case we can also read it is considered a bonus, the primary consumer will be code that we write.
The json module provides a serialization/deserialization interface via it’s load/loads and dump/dumps methods.
import jsondata = {"this": "is some object", "with": ["multiple", "fields"]}# serialization: object -> stras_str = json.dumps(data)print("converted to string", repr(as_str))# we could use also use json.dump which writes to a file# deserializiion: str -> dict|listback_to_obj = json.loads(as_str)# json.load would instead read from a fileprint(type(back_to_obj), back_to_obj)
converted to string '{"this": "is some object", "with": ["multiple", "fields"]}'
<class 'dict'> {'this': 'is some object', 'with': ['multiple', 'fields']}
JSON Considerations
Easy to parse, widely used.
Quite verbose, every field name gets repeated.
Need to define serialization format for non-standard types.
Must be well-formed, a single extra/missing character will cause an error.
Python Type
JSON Type
Notes
string
string
“double quotes only”
int, float
number
dict
object
In JSON all keys must be strings
list
array
bool
bool
JSON true, false are lower case.
None
null
Different name, same purpose.
JSON does not natively support other types. If you have (e.g.) dietetic data, you will need to convert it to types JSON can understand (e.g. “2023-01-01” as string).
XML
XML is made up of arbitrary tags that can be tailored to the data at hand. Typically the set of valid tags & how they nest, etc. is provided in the form of an XML Schema.
One advantage of XML is that it is strict, you cannot omit tags, etc. and instead must match tags & follow the schema if one is provided.
It was the dominant format circa 2000-2010, but today mostly supplanted by JSON & is more commonly found in legacy or enterprise systems.
You will likely encounter it working with government data, and it is related to HTML which we’ll see shortly.
More complex to parse than JSON, but offers more options.
Lots of tools available, including strict validation.
Binary File Formats
Text-based file formats are inefficient when exchanging large amounts of numeric data.
To demonstrate this let’s look at a simple text representation of a location record.
Imagine that a vehicle fleet was sending these records as they drove:
record ='{"id":737894404660,"latitude":44.21191,"longitude":-87.58329,"altitude":14.9,"heading":27.5,"velocity":16,"status":1}'
len(record)
117
117 characters would be between 117 and 468 bytes depending on the encoding.
That isn’t bad, but if this was a message being sent hundreds of times per minute, it would add up.
Serialized data treats everything as text, so the number 200 is stored as:
Character
Decimal
Binary (32-bit Unicode)
'2'
50
0000 0000 0000 0000 0000 0000 0011 0010
'0'
48
0000 0000 0000 0000 0000 0000 0011 0000
'0'
48
0000 0000 0000 0000 0000 0000 0011 0000
That’s 96 bits for “200”. If we wanted to store it as an integer that’d fit in 8 bits 1100 1000.
We can leverage this by writing binary data directly instead of using an intermediary text format.
You may recall that when we open a file, we can specify the mode.
open("filename", "r") opens for reading and open("filename", "w") opens for writing.
Similarly open("filename", "rb") opens for reading in binary mode and open("filename", "wb") opens for writing in binary mode.
Let’s see this in action:
import struct# [id, lat, lng, alt, heading, velocity, status]data = [737894404660, 44.21191, -87.58329, 14.9, 27.5, 16, 1]# We use the struct module to describe the order of data# in a binary file.# # The format string "Qfffff? means:# Q - unsigned 64-bit# fffff - 5 floats# ? - booleanas_bytes = struct.pack("Qfffff?", *data)print("As Bytes:", as_bytes)bytes_size =len(as_bytes)# open file in writable-binary mode (wb)withopen("file.data", "wb") as f: f.write(as_bytes)print()print(f"wrote {len(as_bytes)} bytes to file")# convert byte representation to binary (ascii) for displayprint()print("Binary:", ''.join([f'{b:#010b}'[2:] for b in as_bytes]))# read from binary filewithopen("file.data", "rb") as f: back_to_data = struct.unpack("Qfffff?", f.read())print()print(back_to_data)
As Bytes: b'4\x12\xef\xcd\xab\x00\x00\x00\xff\xd80B\xa5*\xaf\xc2ffnA\x00\x00\xdcA\x00\x00\x80A\x01'
wrote 29 bytes to file
Binary: 0011010000010010111011111100110110101011000000000000000000000000111111111101100000110000010000101010010100101010101011111100001001100110011001100110111001000001000000000000000011011100010000010000000000000000100000000100000100000001
(737894404660, 44.211910247802734, -87.58329010009766, 14.899999618530273, 27.5, 16.0, True)
Binary File Considerations
Binary file formats are typically much more efficient storage-wise, especially as data grows.
Binary file formats typically need to embed version information, since if fields change, parser will need to know.
Text-based formats allow usage of any editor, can be edited by users easily (which may be good or bad).
Text-based formats are typically more flexible by design. (Depends on encoding & format.)
Common Binary File Formats:
PDF, DOC, XLS - Documents
JPG, PNG, BMP - Images
ZIP, RAR, TAR, GZ - Archives
MP3, AAC - Audio
MP4, MOV, WebM - Video
Final Thoughts
There’s no singular perfect format. The right choice for publishing data depends on what data you have and your intended audience’s capabilities.
As a consumer, typically don’t get to pick what data format you get, so it’ll be important to find tools to handle any format you encounter. No matter the format, you need to know what the data represents & how it’ll be formatted.
In general, the more human-readable the format is, the more expensive it is to parse.