5  Data Formats


Goals


How is data represented on a computer?

You may be aware that at the lowest level, computers represent data in binary.

We can consider a single byte (eight bits): 0110 0001

Each bit corresponds to a power of two.

128 64 32 16 8 4 2 1
0 1 1 0 0 0 0 1

So 0110 0001 represents 64+32+1, or 97.

This works for representing integers, but what about strings?

Those same bits 0110 0001 can also mean the character A.

We can use Python’s built in chr to convert an integer to its text representation:

for x in range(32, 53):
    if x < 47:
        print(f"{x:<3} {chr(x):<3}      {x+20:<3} {chr(x+20):<3}      {x+40:<3} {chr(x+40):<3}      {x+60:<3} {chr(x+60):<3}      {x+80:<3} {chr(x+80):<3}")
    else:
        print(f"{x:<3} {chr(x):<3}      {x+20:<3} {chr(x+20):<3}      {x+40:<3} {chr(x+40):<3}      {x+60:<3} {chr(x+60):<3}")
32           52  4        72  H        92  \        112 p  
33  !        53  5        73  I        93  ]        113 q  
34  "        54  6        74  J        94  ^        114 r  
35  #        55  7        75  K        95  _        115 s  
36  $        56  8        76  L        96  `        116 t  
37  %        57  9        77  M        97  a        117 u  
38  &        58  :        78  N        98  b        118 v  
39  '        59  ;        79  O        99  c        119 w  
40  (        60  <        80  P        100 d        120 x  
41  )        61  =        81  Q        101 e        121 y  
42  *        62  >        82  R        102 f        122 z  
43  +        63  ?        83  S        103 g        123 {  
44  ,        64  @        84  T        104 h        124 |  
45  -        65  A        85  U        105 i        125 }  
46  .        66  B        86  V        106 j        126 ~  
47  /        67  C        87  W        107 k  
48  0        68  D        88  X        108 l  
49  1        69  E        89  Y        109 m  
50  2        70  F        90  Z        110 n  
51  3        71  G        91  [        111 o  
52  4        72  H        92  \        112 p  

These mappings are arbitrary, all that matters is that there is agreement between my computer and yours. The mapping we use for the Latin alphabet is derived from a specification known as ASCII.

Today, most of the time we use Unicode, which contains encodings for characters in virtually every language, as well as emojis.

Any symbol has a mapping, we can use ord() to convert a character back to it’s numeric form, the inverse of chr().

chars = ["漢", "👻", "֎"]
for char in chars:
    print(char, ord(char))
漢 28450
👻 128123
֎ 1422

We call these encodings. A mapping of how we interpret a single byte as a character.

This is an important principle, data on its own is not enough to be interpreted, we also need to know the encoding.

To demonstrate further, let’s consider a run of 4 bytes (32 bits) 0110 0001 0110 0001 0110 0001 0110 0001:

As an integer, this is the number 1,633,771,873.

Interpreted as a float it is 259845894142441816064.

Interpreted as 4 separate bytes, it is 97 97 97 97.

These four bytes might mean "AAAA", or in an image format, might represent a single pixel (Red=97, Green=97, Blue=97, Alpha=97).

Ultimately, we need more information to interpret bits as data, we need to know how the data is meant to be interpreted.

Text Files

When you open a .py, .r, or .md file, the data itself is encoded in a text encoding like Unicode. These are all colloquially known as “text files”.

Text files can be edited by a wide variety of editors: simple ones like Windows Notepad, programmer-focused terminal-based editors like neovim, or highly-customizable GUIs like Visual Studio Code or Obsidian.

File Extensions

You may wonder how file extensions (the suffix after the .: .png, .py, .xlsx) interact with these concepts.

The data within a file is in no way impacted by the extension.

Extensions are primarily for human identification of the type of file. That said, some programs may get confused if you modify the extension.

This means if you have image.png and rename it to image.jpg it may still open in your image editor, but the image did not suddenly become a JPG.

To help understand this concept imagine taking a hello.py file and renamed it hello.r. It wouldn’t magically become R code. Thus the lack of a correct extension may confuse users/programs, but it does not affect the data or its encoding.

Serialization

If we want to store data for later retrieval or transmission we need to agree on a format.

If I wanted to represent users in my system I could store their data in files like:

people.txt

Adam is 41 he has 1 cats 0 dogs 1 fish 0 snakes
Kate is 26 she has 0 cats 2 dogs 0 fish 0 snakes

If every person was stored in the same format, we could write a program to interpret this, but it is not convenient. Any change to the format would result in a need to update all files and the parser code.

The generic term for what we’re doing here is serialization.

Serialization is the process of converting a data type in a given language, to a representation that can be shared between programs.

Whenever you are writing to disk, or sending data over a network, you need to serialize the outgoing data.

Deserialization is the inverse, taking data received over the network and converting it back to a representation you can work with.

CSV

We might instead choose to use a CSV file:

name,age,gender,cats,dogs,fish,snakes
Adam,41,M,1,0,1,0
Kate,26,F,0,2,0,0

This represents the same data, but in a more concise and consistent format.

We call CSV a machine-readable format, because it is meant to be interpreted by code. The fact that in this case we can also read it is considered a bonus, the primary consumer will be code that we write.

import csv

data = [
    { 'name': 'Adam', 'age': '41', 'gender': 'M', 'cats': '1', 'dogs': '0', 'fish': '1', 'snakes': '0' },
    { 'name': 'Kate', 'age': '26', 'gender': 'F', 'cats': '0', 'dogs': '2', 'fish': '0', 'snakes': '0' }
]

# write to disk using csv.DictWriter
with open('pets.csv', 'w') as file:
    fieldnames = data[0].keys()
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(data)

# deserialize using csv.DictReader
with open('pets.csv', 'r') as file:
    reader = csv.DictReader(file)
    for row in reader:
        print(row)
{'name': 'Adam', 'age': '41', 'gender': 'M', 'cats': '1', 'dogs': '0', 'fish': '1', 'snakes': '0'}
{'name': 'Kate', 'age': '26', 'gender': 'F', 'cats': '0', 'dogs': '2', 'fish': '0', 'snakes': '0'}

CSV Considerations

  • Ideal for many records with same fields.
  • If comma character is inconvenient, can instead split on any character (\t and | are common delimiters)
  • Do not handle hierarchical/nested data well.

JSON

JSON is a data format created for data interchange over the internet, and has become the most common format of APIs.

The format resembles how JavaScript (and Python) represent data and should look familiar:

[
  {
    "name": "Adam",
    "age": 41,
    "pets": {"snakes": 0, "cats": 1, "dogs": 0, "fish": 1}
  },
  {
    "name": "Kate",
    "age": 26,
    "pets": {"snakes": 0, "cats": 0, "dogs": 2, "fish": 0}
  }
]

The json module provides a serialization/deserialization interface via it’s load/loads and dump/dumps methods.

import json
data = {
  "this": "is some object", 
  "with": ["multiple", "fields"]
}

# serialization: object -> str
as_str = json.dumps(data)
print("converted to string", repr(as_str))
# we could use also use json.dump which writes to a file

# deserializiion: str -> dict|list
back_to_obj = json.loads(as_str)
# json.load would instead read from a file
print(type(back_to_obj), back_to_obj)
converted to string '{"this": "is some object", "with": ["multiple", "fields"]}'
<class 'dict'> {'this': 'is some object', 'with': ['multiple', 'fields']}

JSON Considerations

  • Easy to parse, widely used.
  • Quite verbose, every field name gets repeated.
  • Need to define serialization format for non-standard types.
  • Must be well-formed, a single extra/missing character will cause an error.
Python Type JSON Type Notes
string string “double quotes only”
int, float number
dict object In JSON all keys must be strings
list array
bool bool JSON true, false are lower case.
None null Different name, same purpose.

JSON does not natively support other types. If you have (e.g.) dietetic data, you will need to convert it to types JSON can understand (e.g. “2023-01-01” as string).

XML

XML is made up of arbitrary tags that can be tailored to the data at hand. Typically the set of valid tags & how they nest, etc. is provided in the form of an XML Schema.

One advantage of XML is that it is strict, you cannot omit tags, etc. and instead must match tags & follow the schema if one is provided.

It was the dominant format circa 2000-2010, but today mostly supplanted by JSON & is more commonly found in legacy or enterprise systems.

You will likely encounter it working with government data, and it is related to HTML which we’ll see shortly.

<?xml version="1.0" encoding="UTF-8"?>
<people>
    <person>
        <name age="41">Adam</name>
        <pets>
            <snakes>0</snakes>
            <cats>1</cats>
            <dogs>0</dogs>
            <fish>1</fish>
        </pets>
    </person>
    <person>
        <name age="26">Kate</name>
        <pets>
            <snakes>0</snakes>
            <cats>0</cats>
            <dogs>2</dogs>
            <fish>0</fish>
        </pets>
    </person>
</people>
xml = b"""<?xml version="1.0" encoding="UTF-8"?>
<people>
    <person>
        <name age="41">Adam</name>
        <pets>
            <snakes>0</snakes>
            <cats>1</cats>
            <dogs>0</dogs>
            <fish>1</fish>
        </pets>
    </person>
    <person>
        <name age="26">Kate</name>
        <pets>
            <snakes>0</snakes>
            <cats>0</cats>
            <dogs>2</dogs>
            <fish>0</fish>
        </pets>
    </person>
</people>"""

import lxml.etree

root = lxml.etree.fromstring(xml)
for entry in root.xpath("//person/name"):
    print(entry.text, entry.get("age"))
Adam 41
Kate 26

XML Considerations

  • More complex to parse than JSON, but offers more options.
  • Lots of tools available, including strict validation.

Binary File Formats

Text-based file formats are inefficient when exchanging large amounts of numeric data.

To demonstrate this let’s look at a simple text representation of a location record.

Imagine that a vehicle fleet was sending these records as they drove:

record = '{"id":737894404660,"latitude":44.21191,"longitude":-87.58329,"altitude":14.9,"heading":27.5,"velocity":16,"status":1}'
len(record)
117

117 characters would be between 117 and 468 bytes depending on the encoding.

That isn’t bad, but if this was a message being sent hundreds of times per minute, it would add up.

Serialized data treats everything as text, so the number 200 is stored as:

Character Decimal Binary (32-bit Unicode)
'2' 50 0000 0000 0000 0000 0000 0000 0011 0010
'0' 48 0000 0000 0000 0000 0000 0000 0011 0000
'0' 48 0000 0000 0000 0000 0000 0000 0011 0000

That’s 96 bits for “200”. If we wanted to store it as an integer that’d fit in 8 bits 1100 1000.

We can leverage this by writing binary data directly instead of using an intermediary text format.

You may recall that when we open a file, we can specify the mode.

open("filename", "r") opens for reading and open("filename", "w") opens for writing.

Similarly open("filename", "rb") opens for reading in binary mode and open("filename", "wb") opens for writing in binary mode.

Let’s see this in action:

import struct
# [id, lat, lng, alt, heading, velocity, status]
data = [737894404660, 44.21191, -87.58329, 14.9, 27.5, 16, 1]

# We use the struct module to describe the order of data
# in a binary file.
# 
# The format string "Qfffff? means:
# Q - unsigned 64-bit
# fffff - 5 floats
# ? - boolean
as_bytes = struct.pack("Qfffff?", *data)

print("As Bytes:", as_bytes)
bytes_size = len(as_bytes)

# open file in writable-binary mode (wb)
with open("file.data", "wb") as f:
    f.write(as_bytes)
print()
print(f"wrote {len(as_bytes)} bytes to file")

# convert byte representation to binary (ascii) for display
print()
print("Binary:", ''.join([f'{b:#010b}'[2:] for b in as_bytes]))

# read from binary file
with open("file.data", "rb") as f:
    back_to_data = struct.unpack("Qfffff?", f.read())
    print()
    print(back_to_data)
As Bytes: b'4\x12\xef\xcd\xab\x00\x00\x00\xff\xd80B\xa5*\xaf\xc2ffnA\x00\x00\xdcA\x00\x00\x80A\x01'

wrote 29 bytes to file

Binary: 0011010000010010111011111100110110101011000000000000000000000000111111111101100000110000010000101010010100101010101011111100001001100110011001100110111001000001000000000000000011011100010000010000000000000000100000000100000100000001

(737894404660, 44.211910247802734, -87.58329010009766, 14.899999618530273, 27.5, 16.0, True)

Binary File Considerations

  • Binary file formats are typically much more efficient storage-wise, especially as data grows.
  • Binary file formats typically need to embed version information, since if fields change, parser will need to know.
  • Text-based formats allow usage of any editor, can be edited by users easily (which may be good or bad).
  • Text-based formats are typically more flexible by design. (Depends on encoding & format.)

Common Binary File Formats:

  • PDF, DOC, XLS - Documents
  • JPG, PNG, BMP - Images
  • ZIP, RAR, TAR, GZ - Archives
  • MP3, AAC - Audio
  • MP4, MOV, WebM - Video

Final Thoughts

There’s no singular perfect format. The right choice for publishing data depends on what data you have and your intended audience’s capabilities.

As a consumer, typically don’t get to pick what data format you get, so it’ll be important to find tools to handle any format you encounter. No matter the format, you need to know what the data represents & how it’ll be formatted.

In general, the more human-readable the format is, the more expensive it is to parse.

Further Exploration

For more details on JSON and other file formats:

Some helpful Python module documentation:

  • pathlib Appendix - Notes on using the pathlib module.
  • json - Python standard library module for parsing JSON.
  • csv - Python standard library module for parsing CSV.
  • yaml - Python library for parsing YAML, a common configuration format.

  1. See https://fabiensanglard.net/floating_point_visually_explained/ to understand how floats are stored.↩︎