Read PostgreSQL Table Files with Go

21 Dec 2020

Preamble

This exercise got away from me and proved to be a fun learning experience in itself. I was going to make a PostgreSQL table with a few values and then write some Go code to parse that table: a little exercise so that I could solidify my skills with Go's standard library (around io and parsing binary data) and learn about PostgreSQL's on-disk format along the way.

Well, once you go down the PostgreSQL disk layout rabbit hole, you'll find yourself spending way more time on Google than you thought you would, and quality time poring through source code, and discovering that in-memory representations of PostgreSQL data don't always match on-disk representations of PostgreSQL data. You'll find yourself using hexdump and learning about Postgres's varlenas (variable-length arrays) and how shorter varlenas have a different on-disk representation than longer varlenas.

You'll find yourself reading Stack Overflow answers where you'll see people incorrectly assert that a tuple header is 24 bytes, but you'll know it's 23 bytes with perhaps a byte of padding before the first attribute's data starts; and that this is for a tuple with no null columns and no oid. You'll wonder how much you need to worry about MAXALIGN.

Needless to say, this blog entry only ended up scratching the surface of parsing PostgreSQL's table files using Go. But it was a fun (and eye-opening) exercise. The code here follows the dictum "first make it work, then make it pretty" in that I have not even bothered with the pretty part; as I iterate through the Go code showing how I built up my knowledge, I keep everything in the main() method and leave it at that.

The Blog Entry

Let's create a table in PostgreSQL and load it with a few values.

# create table t (
    id bigint constraint t_pk primary key not null,
    name text not null);
CREATE TABLE
# commit;
COMMIT

# insert into t (id, name)
    values
    (1, 'Alice'),
    (2, 'Cheshire Cat'),
    (3, 'Red Queen'),
    (4, 'White Rabbit');
INSERT 0 4
# commit;
COMMIT

Now let's find out where our table lives on disk.

# select pg_relation_filepath('t'); rollback;
┌──────────────────────┐
│ pg_relation_filepath │
├──────────────────────┤
│ base/12709/16567     │
└──────────────────────┘

# show data_directory; rollback;
┌─────────────────────────────────┐
│         data_directory          │
├─────────────────────────────────┤
│ /usr/local/postgresql-12.3/data │
└─────────────────────────────────┘

To ease file permission troubles, let's create a copy of that file and permission it so that we can play with it.

$ su -
# cp /usr/local/postgresql-12.3/data/base/12709/16567 /home/mwood
# chown mwood:mwood /home/mwood/16567 
# exit

The PostgreSQL docs have this page on database page layout. This is already interesting. It looks like a table file is a series of one or more database pages. Each page is 8k. The table we have been playing with only has 4 rows, and those rows should take up less than 8k of storage, but if we look at the size of our databse file, it is indeed exactly 8k, meaning there's a lot of empty space in this file! But it also means that the docs aren't lying when they say a database page is 8k. (This should also mean that once we overflow our first 8k page, our database file will become 8k larger: another database page will get added to our file.)

$ ls -lh 16567 
-rw------- 1 mwood mwood 8.0K Dec 20 22:18 16567

So there it is: a PostgreSQL table file is an array of one or more 8 kB pages. Table files grow as large as 1 GB and then spill over into a new file that, in turn, is allowed to get as large as 1 GB before spilling over to a new file, etc.

database file:
┌───────────┬───────────┬───────────┬─       ─┬───────────┐
│ 8 kB page │ 8 kB page │ 8 kB page │   ...   │ 8 kB page │
└───────────┴───────────┴───────────┴─       ─┴───────────┘

An 8 kB page consists of a header, followed by an array of ItemIDData "pointers", followed by empty space in the page. The end of the page has rows (or tuples) that get added backwards into the page before there is too little room between the page's ItemIDData "pointers" and the tuples themselves; then, a new page is created.

8 kB page:
┌────────────────┬────────┬────────┬───────────────────────┐
│ PageHeaderData │ ItemId │ ItemId │ grows this way →      │
├────────────────┴────────┴────────┘                       │
│                                                          │
│                                                          │
│                                          ┌───────┬───────┤
│                         ← grows this way │ tuple │ tuple │
└──────────────────────────────────────────┴───────┴───────┘

Index pages have a special area at the end of each page, but we won't be worrying about that here, because we are just trying to learn about table pages, which don't have that special area.

The postgresql docs say that the first 24 byes of a database page (and therefore the first 24 bytes of our database file that consists of only one 8k database page) is a page header. Moreover, the first 8 bytes of this header (and therfore the file) are a field named pd_lsn of type PageXLogRecPtr (which is, in turn, two uint32s).

PageHeaderData: (one char width = 1 bit but pretend box drawing chars
use no space).
┌────────────────────────────────┬────────────────────────────────┐
│ PageXLogRecPtr:XLogID uint32   │ PageXLogRecPtr:XRecOff uint32  │
├────────────────┬───────────────┼────────────────┬───────────────┤
│ PdChecksum u16 │ PdFlags u16   │ PdLower uint16 │ PdUpper u16   │
├────────────────┼───────────────┼────────────────┴───────────────┤
│ PdSpecial u16  │ PdPgSzVer u16 │ PdPruneXID  uint32             │
└────────────────┴───────────────┴────────────────────────────────┘

/usr/local/src/postgresql-12.3/src/include/storage/bufpage.h:

typedef struct
{
        uint32          xlogid;                 /* high bits */
        uint32          xrecoff;                /* low bits */
} PageXLogRecPtr;

typedef struct PageHeaderData
{
        PageXLogRecPtr pd_lsn;
...

Let's just read the first 8 bytes of the table file and see if we can do that.

Go has an interface, io.Reader, that has a Read method:

type Reader interface {
    Read(p []byte) (n int, err error)
}

Fortunately, os.File satisfies that interface.

We could therefore read the first 4 bytes of our file into xlogid like so, and print out the contents:

package main

import (
  "fmt"
  "log"
  "os"
)

func main() {
  path := "/home/mwood/16567"

  file, err := os.Open(path)
  if err != nil {
    log.Fatalf("Could not open '%s': %v", path, err)
  }
  defer file.Close() // ignore possible close error

  xlogid := make([]byte, 4)

  n, err := file.Read(xlogid)

  if err != nil {
    log.Fatalf("Could not read '%s': %v", path, err)
  }

  fmt.Printf("Read %d bytes: %+v\n", n, xlogid)
}

Prints

Read 4 bytes: [0 0 0 0]

Of course, we know the first 4 bytes of the table file are actually a uint32, so why not read those bytes as a unint32 rather than as a slice of 4 bytes? Go's encoding/binary library lets us do that. It's even sophisticated enough for us to let it know the endianness of the number so that it knows how to turn the bytes into a uint32.

This time, let's read in the first 4 bytes of the table file and the next 4 bytes of the table file, because we know that both of those are little-endian uint32s.

package main

import (
"encoding/binary"
"fmt" "log" "os" ) func main() { path := "/home/mwood/16567" file, err := os.Open(path) if err != nil { log.Fatalf("Could not open '%s': %v", path, err) } defer file.Close() // ignore possible close error xlogid := make([]byte, 4) n, err := file.Read(xlogid)
if err != nil { log.Fatalf("Could not read xlogid from '%s': %v", path, err) } fmt.Printf("Read %d bytes for xlogid: %+v\n", n, xlogid) xrecoff := make([]byte, 4) n, err = file.Read(xrecoff) if err != nil { log.Fatalf("Could not read xrecoff from '%s': %v", path, err) } fmt.Printf("Read %d bytes for xrecoff: %+v\n", n, xrecoff) xrecoff32 := binary.LittleEndian.Uint32(xrecoff) fmt.Printf("xrecoff: %d\n", xrecoff32)
}

Prints

Read 4 bytes for xlogid: [0 0 0 0]
Read 4 bytes for xrecoff: [184 195 117 1]
xrecoff: 24495032

The Go standard library's encoding/binary can do way more than that. It can beging with a struct representing the header of the table file and correctly slurp the first 24 bytes of the table file into the struct!

Let's take a closer look at the PageHeaderData struct in the PostgreSQL source code's /usr/local/src/postgresql-12.3/src/include/storage/bufpage.h:

typedef struct PageHeaderData
{
        PageXLogRecPtr  pd_lsn;
        uint16          pd_checksum;
        uint16          pd_flags;
        LocationIndex   pd_lower;
        LocationIndex   pd_upper;
        LocationIndex   pd_special;
        uint16          pd_pagesize_version;
        TransactionId   pd_prune_xid;
        ItemIdData      pd_linp[FLEXIBLE_ARRAY_MEMBER];
} PageHeaderData;

We already saw the definition of PageXLogRecPtr (it's two uint32s). Further inspection of /usr/local/src/postgresql-12.3/src/include/storage/bufpage.h shows that LocationIndex is a uint16, and TransactionId is a uint32. ItemData we will ignore for now and deal with later.

Our Go equivalent of PostgreSQL's PageHeaderData struct will therefore look like this:

type PageHeader struct {
  XLogID            uint32
  XRecOff           uint32
  PdChecksum        uint16
  PdFlags           uint16
  PdLower           uint16
  PdUpper           uint16
  PdSpecial         uint16
  PdPagesizeVersion uint16
  PdPruneXID        uint32
}

And here is our program that slurps in the first 24 byes of our database file into the PageHeader struct:

package main
  
import (
  "encoding/binary"
  "fmt"
  "log"
  "os"
)

// PageHeader is based on pg's PageHeaderData struct // from src/include/storage/bufpage.h. type PageHeader struct { XLogID uint32 XRecOff uint32 PdChecksum uint16 PdFlags uint16 PdLower uint16 PdUpper uint16 PdSpecial uint16 PdPagesizeVersion uint16 PdPruneXID uint32 }
func main() { path := "/home/mwood/16567" file, err := os.Open(path) if err != nil { log.Fatalf("Could not open '%s': %v", path, err) } defer file.Close() // ignore possible close error
header := &PageHeader{} err = binary.Read(file, binary.LittleEndian, header) if err != nil { log.Fatalf("Could not read page header: %v", err) } fmt.Printf("page header: %+v\n", header)
}

Prints

page header: &{XLogID:0 XRecOff:24495032 PdChecksum:0 PdFlags:0 PdLower:40 PdUpper:8008 PdSpecial:8192 PdPagesizeVersion:8196 PdPruneXID:0}

That worked!

A few notes about the latest iteration of our program: The docs for Go's encoding/binary library say that the first arg of binary.Read() is of type io.Reader! And os.File satisfies the io.Reader interface, which is why we can feed it to binary.Read. Handy!

After the page data header, are the ItemIdData pointers. Each one takes 4 bytes. According to the docs:

"Following the page header are item identifiers (ItemIdData), each requiring four bytes. ...

"The number of item identifiers present can be determined by looking at pd_lower, which is increased to allocate a new identifier."

So if we take the size of the PageHeader (24 bytes) and subtract if from PdLower in our example (40) we get 16. Each ItemIdData is 4 bytes, meaning we have 4 of them. Which is absolutely correct, because we entered 4 rows into our table above!

Each ItemIdData holds three items: 1) the offset from the start of the table page until the start of the tuple, 2) the size in bytes of the tuple, and 3) two bits describing one of four possible states for the tuple (see src/include/storage/itemid.h for details).

These ItemIdData pointers turned out to be tricky! First off the representation of the ItemIdData pointer in RAM (from src/include/storage/itemid.h) is:

typedef struct ItemIdData
{
        unsigned   lp_off:15,              /* offset to tuple (from start of page) */
                   lp_flags:2,             /* state of line pointer, see below */
                   lp_len:15;              /* byte length of tuple */
} ItemIdData;

#define LP_UNUSED               0               /* unused (should always have lp_len=0) */
#define LP_NORMAL               1               /* used (should always have lp_len>0) */
#define LP_REDIRECT             2               /* HOT redirect (should have lp_len=0) */
#define LP_DEAD                 3               /* dead, may or may not have storage */

First, notice the bit fields packed into C struct! We can't do those in Go. Even more interesting is that this seems NOT to be the on-disk layout! On disk, (after messing around with a hex dumper and taking into account endianness), I convinced myself that the first two bytes are lp_off, but the high bit of lp_off is the low bit of lp_flags. The last two bytes are lp_len, but the low bit of lp_len is the high bit of lp_flags.

ItemIdData: (one char width = 1 bit but pretend box drawing chars
use no space).

 ┌─ low bit of LpFlags
 ↓
┌─┬───────────────┬───────────────┬─┐
│ │ LpOff uint15  │ LpLen uint15  │ │
└─┴───────────────┴───────────────┴─┘
                                   ↑
              high bit of LpFlags ─┘

Our program now looks like this:

package main

import (
	"encoding/binary"
	"fmt"
	"log"
	"os"
)

// PageHeader is based on pg's PageHeaderData struct
// from src/include/storage/bufpage.h.
type PageHeader struct {
	XLogID            uint32
	XRecOff           uint32
	PdChecksum        uint16
	PdFlags           uint16
	PdLower           uint16
	PdUpper           uint16
	PdSpecial         uint16
	PdPagesizeVersion uint16
	PdPruneXID        uint32
}

type RawItemIDData struct { LpOff uint16 LpLen uint16 } type ItemIDData struct { LpOff uint16 LpLen uint16 LpFlags byte } const PageHeaderByteSize int = 24 const ItemIDByteSize int = 4
func main() { path := "/home/mwood/16567" file, err := os.Open(path) if err != nil { log.Fatalf("Could not open '%s': %v", path, err) } defer file.Close() // ignore possible close error header := &PageHeader{} err = binary.Read(file, binary.LittleEndian, header) if err != nil { log.Fatalf("Could not read page header: %v", err) } fmt.Printf("page header: %+v\n", header)
itemIDDataSize := (int(header.PdLower) - PageHeaderByteSize) / ItemIDByteSize fmt.Printf("Number of ItemIDData structs: %d\n", itemIDDataSize) itemIDDataPointers := make([]ItemIDData, itemIDDataSize) for i := 0; i < itemIDDataSize; i++ { raw := &RawItemIDData{} err = binary.Read(file, binary.LittleEndian, raw) if err != nil { log.Fatalf("Could not read %dth ItemIDData: %v", i, err) } newItem := ItemIDData{ LpOff: raw.LpOff & 0b_01111111_11111111, LpLen: raw.LpLen >> 1, LpFlags: byte(((raw.LpOff >> 15) & 0b_00000000_00000001) | ((raw.LpLen << 1) & 0b_00000000_00000010)), } itemIDDataPointers[i] = newItem } for _, p := range itemIDDataPointers { fmt.Printf("item ID data pointer: %+v\n", p) }
}

Prints

page header: &{XLogID:0 XRecOff:24495032 PdChecksum:0 PdFlags:0 PdLower:40 PdUpper:8008 PdSpecial:8192 PdPagesizeVersion:8196 PdPruneXID:0}
Number of ItemIDData structs: 4
item ID data pointer: {LpOff:8152 LpLen:38 LpFlags:1}
item ID data pointer: {LpOff:8104 LpLen:45 LpFlags:1}
item ID data pointer: {LpOff:8056 LpLen:42 LpFlags:1}
item ID data pointer: {LpOff:8008 LpLen:45 LpFlags:1}

One thing to note is that each LpOff is counted from the very start of the 8k database page. So to get to each row of data, I could use os.File.Seek() to be sure I'm at the right place in the file before doing my next binary.Read(), which has been serving me so well up until now. However, once my database file grows to more than one database page, os.File.Seek() won't be as useful, because my options are to seek from 1) the start of the file, 2) where I am in the file now, and 3) the end of the file. If my database page starts in neither of those locations, then starting from there will be tricky. Why don't I read in the entire 8k page into a slice of bytes and operate on that instead?

In this next iteration of our code, I have done just that, turning the contents of our 1-page file into a slice of bytes named pageBytes. I wrap pageBytes in a bytes.Reader named pageReader. bytes.Reader implements io.Reader, so the rest of our code can remain unchnaged: instead of the io.Reader being backed by an os.File, the io.Reader is instead backed by a bytes.Reader.

package main

import (
"bytes"
"encoding/binary" "fmt"
"io"
"log" "os" ) // PageHeader is based on pg's PageHeaderData struct // from src/include/storage/bufpage.h. type PageHeader struct { XLogID uint32 XRecOff uint32 PdChecksum uint16 PdFlags uint16 PdLower uint16 PdUpper uint16 PdSpecial uint16 PdPagesizeVersion uint16 PdPruneXID uint32 } type RawItemIDData struct { LpOff uint16 LpLen uint16 } type ItemIDData struct { LpOff uint16 LpLen uint16 LpFlags byte } const PageHeaderByteSize int = 24 const ItemIDByteSize int = 4
const PageSize int = 8192
func main() { path := "/home/mwood/16567" file, err := os.Open(path) if err != nil { log.Fatalf("Could not open '%s': %v", path, err) } defer file.Close() // ignore possible close error
pageBytes := make([]byte, PageSize) n, err := file.Read(pageBytes) if n != PageSize { log.Fatalf("Only read %d bytes of '%s'; expected to read %d bytes", n, path, PageSize) } if err != nil && err != io.EOF { log.Fatalf("Problem reading database page from '%s': %v", path, err) } pageReader := bytes.NewReader(pageBytes) header := &PageHeader{}
err = binary.Read(pageReader, binary.LittleEndian, header) if err != nil { log.Fatalf("Could not read page header: %v", err) } fmt.Printf("page header: %+v\n", header) itemIDDataSize := (int(header.PdLower) - PageHeaderByteSize) / ItemIDByteSize fmt.Printf("Number of ItemIDData structs: %d\n", itemIDDataSize) itemIDDataPointers := make([]ItemIDData, itemIDDataSize) for i := 0; i < itemIDDataSize; i++ { raw := &RawItemIDData{}
err = binary.Read(pageReader, binary.LittleEndian, raw)
if err != nil { log.Fatalf("Could not read %dth ItemIDData: %v", i, err) } newItem := ItemIDData{ LpOff: raw.LpOff & 0b_01111111_11111111, LpLen: raw.LpLen >> 1, LpFlags: byte(((raw.LpOff >> 15) & 0b_00000000_00000001) | ((raw.LpLen << 1) & 0b_00000000_00000010)), } itemIDDataPointers[i] = newItem } for _, p := range itemIDDataPointers { fmt.Printf("item ID data pointer: %+v\n", p) } }

Now we can use IdemId.LpOff offsets from the start of our data page to read each row. According to the PostgreSQL docs, there is a 23-byte header at the start of each row, defined in the PostgreSQL source code's src/include/access/htup_details.h file. In particular, note how table 68.4 says that t_cid and t_xvac overlay each other!

tuple: (one char width = 1 bit but pretend box drawing chars
use no space; also, "var bytes" is variable number of bytes).
┌───────────────────────┬─────────────────────┬─────────┬─       ─┐
│ RowHeader (23 bytes)  │ field 1 (var bytes) │ field 2 │   ...   │
└───────────────────────┴─────────────────────┴─────────┴─       ─┘

Here is the definition of HeapTupleFields from src/include/access/htup_details.h; HeapTupleFields defines the the first 3 fields of HeapTupleHeaderData, also found in src/include/access/htup_details.h, and which corresponds to table 68.4 mentioned above.

typedef struct HeapTupleFields
{
        TransactionId t_xmin;
        TransactionId t_xmax;
        union {
          CommandId       t_cid; // inserting or deleting command ID, or both
          TransactionId t_xvac; // old-style VACUUM FULL xact ID
        } t_field3;
} HeapTupleFields;

I went and printed out all of the hidden columns for our table just so I would know what the correct values were as I played with different byte widths/offsets within the row headers:

select tableoid, xmin, xmax, cmin, cmax, ctid, * from t; rollback;
┌──────────┬──────┬──────┬──────┬──────┬───────┬────┬──────────────┐
│ tableoid │ xmin │ xmax │ cmin │ cmax │ ctid  │ id │     name     │
├──────────┼──────┼──────┼──────┼──────┼───────┼────┼──────────────┤
│    16567 │  552 │    0 │    0 │    0 │ (0,1) │  1 │ Alice        │
│    16567 │  552 │    0 │    0 │    0 │ (0,2) │  2 │ Cheshire Cat │
│    16567 │  552 │    0 │    0 │    0 │ (0,3) │  3 │ Red Queen    │
│    16567 │  552 │    0 │    0 │    0 │ (0,4) │  4 │ White Rabbit │
└──────────┴──────┴──────┴──────┴──────┴───────┴────┴──────────────┘

After much experimenting, I think this is the correct disk layout of a row header:

RowHeader: (one char width = 1 bit but pretend box drawing chars
use no space). There was no room to fit the word "uint16",
but all 16-bit-wide fields are uint16. Hoff is 1 byte.
┌────────────────────────────────┬────────────────────────────────┐
│ Xmin uint32                    │ Xmax uint32                    │
├────────────────────────────────┼────────────────┬───────────────┤
│ CId uint32                     │ IPDBlockIdHi   │ IPDBlockIdLo  │
├────────────────┬───────────────┼────────────────┼────────┬──────┘
│ IPDOffsetNumbr │ NumAttr/flags │ InfoMask u16   │ Hoff   │
└────────────────┴───────────────┴────────────────┴────────┘
                         ↓
                ┌───────────┬─────┐
                │ NumAttrbs │ flg │
                └───────────┴─────┘
Only 11 bits of NumAttr/flags are used to count
the number of attributes; the remaining 5 bits are flags.
type RowHeader struct {
	Xmin uint32
	Xmax uint32
	// in Postgres's HeapTupleFields, t_cid overlaps with t_xvac
	CId uint32
	// Next 3 fields are ItemPointerData,
	// broken down into BlockIdDataHi, BlockIdDataLo, and OffsetNumber
	IPDBlockIdHi    uint16
	IPDBlockIdLo    uint16
	IPDOffsetNumber uint16
	// NumAttribs is also known as InfoMask2 because only 11 bits are used
	// to count the number of attributes, and the remaining 5 bits
	// are attribute bits.
	NumAttribs uint16
	InfoMask   uint16
	Hoff       byte
}

What we do is use each item id data pointer's LpOff (offset from the start of the database page) to slice our pageBytes slice into a smaller slice (pageBytes[p.LpOff:]) that is still a window into the underlying storage of pageBytes. No data are copied! pageBytes[p.LpOff:] is just a restricted view / subslice of the same bytes at the same memory addresses as pageBytes. But what's handy about this approach is then when we hand that subslice to bytes.NewReader(), we are reading from the beginning of that subslice, even though the subslice itself is some offset of pageBytes.

Here is an iteration of our program that uses all of the item id offsets and reads the headers of each row.

package main

import (
	"bytes"
	"encoding/binary"
	"fmt"
	"io"
	"log"
	"os"
)

// PageHeader is based on pg's PageHeaderData struct
// from src/include/storage/bufpage.h.
type PageHeader struct {
	XLogID            uint32
	XRecOff           uint32
	PdChecksum        uint16
	PdFlags           uint16
	PdLower           uint16
	PdUpper           uint16
	PdSpecial         uint16
	PdPagesizeVersion uint16
	PdPruneXID        uint32
}

type RawItemIDData struct {
	LpOff uint16
	LpLen uint16
}

type ItemIDData struct {
	LpOff   uint16
	LpLen   uint16
	LpFlags byte
}

type RowHeader struct { Xmin uint32 Xmax uint32 // in Postgres's HeapTupleFields, t_cid overlaps with t_xvac CId uint32 // Next 3 fields are ItemPointerData, // broken down into BlockIdDataHi, BlockIdDataLo, and OffsetNumber IPDBlockIdHi uint16 IPDBlockIdLo uint16 IPDOffsetNumber uint16 // NumAttribs is also known as InfoMask2 because only 11 bits are used // to count the number of attributes, and the remaining 5 bits // are attribute bits. NumAttribs uint16 InfoMask uint16 Hoff byte }
const PageHeaderByteSize int = 24 const ItemIDByteSize int = 4 const PageSize int = 8192 func main() { path := "/home/mwood/16567" file, err := os.Open(path) if err != nil { log.Fatalf("Could not open '%s': %v", path, err) } defer file.Close() // ignore possible close error pageBytes := make([]byte, PageSize) n, err := file.Read(pageBytes) if n != PageSize { log.Fatalf("Only read %d bytes of '%s'; expected to read %d bytes", n, path, PageSize) } if err != nil && err != io.EOF { log.Fatalf("Problem reading database page from '%s': %v", path, err) } pageReader := bytes.NewReader(pageBytes) header := &PageHeader{} err = binary.Read(pageReader, binary.LittleEndian, header) if err != nil { log.Fatalf("Could not read page header: %v", err) } fmt.Printf("page header: %+v\n", header) itemIDDataSize := (int(header.PdLower) - PageHeaderByteSize) / ItemIDByteSize fmt.Printf("Number of ItemIDData structs: %d\n", itemIDDataSize) itemIDDataPointers := make([]ItemIDData, itemIDDataSize) for i := 0; i < itemIDDataSize; i++ { raw := &RawItemIDData{} err = binary.Read(pageReader, binary.LittleEndian, raw) if err != nil { log.Fatalf("Could not read %dth ItemIDData: %v", i, err) } newItem := ItemIDData{ LpOff: raw.LpOff & 0b_01111111_11111111, LpLen: raw.LpLen >> 1, LpFlags: byte(((raw.LpOff >> 15) & 0b_00000000_00000001) | ((raw.LpLen << 1) & 0b_00000000_00000010)), } itemIDDataPointers[i] = newItem } for _, p := range itemIDDataPointers { fmt.Printf("item ID data pointer: %+v\n", p)
rowReader := bytes.NewReader(pageBytes[p.LpOff:]) rowHeader := &RowHeader{} err = binary.Read(rowReader, binary.LittleEndian, rowHeader) if err != nil { log.Fatalf("Could not read row header: %v", err) } fmt.Printf("row header: %+v\n", rowHeader) // According to PostgreSQL's src/include/access/htup_details.h, // only the first 11 bits of rowHeader.NumAttribs is used for // the number of attributes; the remaining 5 bits are used as flags. fmt.Printf("num attributes: %d\n", (rowHeader.NumAttribs & 0x07FF))
} }

Prints

page header: &{XLogID:0 XRecOff:24495032 PdChecksum:0 PdFlags:0 PdLower:40 PdUpper:8008 PdSpecial:8192 PdPagesizeVersion:8196 PdPruneXID:0}
Number of ItemIDData structs: 4
item ID data pointer: {LpOff:8152 LpLen:38 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:1 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
item ID data pointer: {LpOff:8104 LpLen:45 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:2 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
item ID data pointer: {LpOff:8056 LpLen:42 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:3 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
item ID data pointer: {LpOff:8008 LpLen:45 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:4 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2

Note the comment I left in the code about rowHeader.NumAttribs & 0x07FF above; only the first 11 bits of NumAttribs are actually used to show the number of columns in a tuple! In src/include/access/htup_details.h, t_infomask2 corresponds to RowHeader.NumAttribs in our program:

/*
 * information stored in t_infomask2:
 */
#define HEAP_NATTS_MASK                 0x07FF  /* 11 bits for number of attributes */
/* bits 0x1800 are available */
#define HEAP_KEYS_UPDATED               0x2000  /* tuple was updated and key cols
                                                                                 * modified, or tuple deleted */
#define HEAP_HOT_UPDATED                0x4000  /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE                 0x8000  /* this is heap-only tuple */

#define HEAP2_XACT_MASK                 0xE000  /* visibility-related bits */

Things continue to get interesting. We actually need to consult pg's system tables to know how to parse the rows themselves! In particular, attlen and attalign.

select attname,
       typname,
       attlen,
       attalign
  from pg_attribute
  join pg_type
    on pg_attribute.atttypid = pg_type.oid
 where attrelid = 16567; rollback;
┌──────────┬─────────┬────────┬──────────┐
│ attname  │ typname │ attlen │ attalign │
├──────────┼─────────┼────────┼──────────┤
│ tableoid │ oid     │      4 │ i        │
│ cmax     │ cid     │      4 │ i        │
│ xmax     │ xid     │      4 │ i        │
│ cmin     │ cid     │      4 │ i        │
│ xmin     │ xid     │      4 │ i        │
│ ctid     │ tid     │      6 │ s        │
│ id       │ int8    │      8 │ d        │ -- <== our id column (bigint)
│ name     │ text    │     -1 │ i        │ -- <== our name column (text)
└──────────┴─────────┴────────┴──────────┘

According to the PostgreSQL docs, attlen 8 for our id column just means the column is 8 bytes wide, and attalign 'd' also means 8 bytes ('d' is for 'double').

Here is how the id field is laid out: It's just a 64-bit int, laid out in little-endian order (presumably because my Pg installation is on an x86).

┌─────────────────────────────────────────────────────────────────┐
│ Id field of tuple (int64)                                       │
└─────────────────────────────────────────────────────────────────┘

For the name column, attlen -1 means a variable-length type (which indeed the text type is), and attalign 'i' means 'int' (4 byte) alignment, which is intersting: there may be padding for our strings!

Normally after the row header, one would find a number of bits determining whether or not our two columns were null, but for tables with no null columns (as ours is) this bitfield does not exist. Also, our table does not use oids (the default in PostgreSQL for a long time now) so that post-header field also does not exist. Regardless, rowHeader.Hoff is the "header offset", which is the number of bytes from the start of the header that we have to read until we reach the first attribute, which, in our case, is the 'id' column of our table.

Here's the next iteration of our program, which can print the id column of our table!

package main

import (
	"bytes"
	"encoding/binary"
	"fmt"
	"io"
	"log"
	"os"
)

// PageHeader is based on pg's PageHeaderData struct
// from src/include/storage/bufpage.h.
type PageHeader struct {
	XLogID            uint32
	XRecOff           uint32
	PdChecksum        uint16
	PdFlags           uint16
	PdLower           uint16
	PdUpper           uint16
	PdSpecial         uint16
	PdPagesizeVersion uint16
	PdPruneXID        uint32
}

type RawItemIDData struct {
	LpOff uint16
	LpLen uint16
}

type ItemIDData struct {
	LpOff   uint16
	LpLen   uint16
	LpFlags byte
}

type RowHeader struct {
	Xmin uint32
	Xmax uint32
	// in Postgres's HeapTupleFields, t_cid overlaps with t_xvac
	CId uint32
	// Next 3 fields are ItemPointerData,
	// broken down into BlockIdDataHi, BlockIdDataLo, and OffsetNumber
	IPDBlockIdHi    uint16
	IPDBlockIdLo    uint16
	IPDOffsetNumber uint16
	// NumAttribs is also known as InfoMask2 because only 11 bits are used
	// to count the number of attributes, and the remaining 5 bits
	// are attribute bits.
	NumAttribs uint16
	InfoMask   uint16
	Hoff       byte
}

const PageHeaderByteSize int = 24
const ItemIDByteSize int = 4
const PageSize int = 8192

func main() {
	path := "/home/mwood/16567"

	file, err := os.Open(path)
	if err != nil {
		log.Fatalf("Could not open '%s': %v", path, err)
	}
	defer file.Close() // ignore possible close error

	pageBytes := make([]byte, PageSize)
	n, err := file.Read(pageBytes)
	if n != PageSize {
		log.Fatalf("Only read %d bytes of '%s'; expected to read %d bytes", n, path, PageSize)
	}
	if err != nil && err != io.EOF {
		log.Fatalf("Problem reading database page from '%s': %v", path, err)
	}

	pageReader := bytes.NewReader(pageBytes)

	header := &PageHeader{}
	err = binary.Read(pageReader, binary.LittleEndian, header)
	if err != nil {
		log.Fatalf("Could not read page header: %v", err)
	}

	fmt.Printf("page header: %+v\n", header)

	itemIDDataSize := (int(header.PdLower) - PageHeaderByteSize) / ItemIDByteSize

	fmt.Printf("Number of ItemIDData structs: %d\n", itemIDDataSize)

	itemIDDataPointers := make([]ItemIDData, itemIDDataSize)

	for i := 0; i < itemIDDataSize; i++ {
		raw := &RawItemIDData{}
		err = binary.Read(pageReader, binary.LittleEndian, raw)
		if err != nil {
			log.Fatalf("Could not read %dth ItemIDData: %v", i, err)
		}
		newItem := ItemIDData{
			LpOff: raw.LpOff & 0b_01111111_11111111,
			LpLen: raw.LpLen >> 1,
			LpFlags: byte(((raw.LpOff >> 15) & 0b_00000000_00000001) |
				((raw.LpLen << 1) & 0b_00000000_00000010)),
		}
		itemIDDataPointers[i] = newItem
	}

	for _, p := range itemIDDataPointers {
		fmt.Printf("item ID data pointer: %+v\n", p)
		rowReader := bytes.NewReader(pageBytes[p.LpOff:])
		rowHeader := &RowHeader{}
		err = binary.Read(rowReader, binary.LittleEndian, rowHeader)
		if err != nil {
			log.Fatalf("Could not read row header: %v", err)
		}
		fmt.Printf("row header: %+v\n", rowHeader)
		// According to PostgreSQL's src/include/access/htup_details.h,
		// only the first 11 bits of rowHeader.NumAttribs is used for
		// the number of attributes; the remaining 5 bits are used as flags.
		fmt.Printf("num attributes: %d\n", (rowHeader.NumAttribs & 0x07FF))

// rowHeader.Hoff is the "header offset", which is the number of // bytes from the start of the header that we have to read until // we reach the first attribute. _, err = rowReader.Seek(int64(rowHeader.Hoff), io.SeekStart) if err != nil { log.Fatalf("Could not seek offset %d: %v", rowHeader.Hoff, err) } var id int64 err = binary.Read(rowReader, binary.LittleEndian, &id) if err != nil { log.Fatalf("Could not read id column: %v", err) } fmt.Printf("id: %d\n", id)
} }

Prints:

page header: &{XLogID:0 XRecOff:24495032 PdChecksum:0 PdFlags:0 PdLower:40 PdUpper:8008 PdSpecial:8192 PdPagesizeVersion:8196 PdPruneXID:0}
Number of ItemIDData structs: 4
item ID data pointer: {LpOff:8152 LpLen:38 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:1 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
id: 1
item ID data pointer: {LpOff:8104 LpLen:45 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:2 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
id: 2
item ID data pointer: {LpOff:8056 LpLen:42 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:3 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
id: 3
item ID data pointer: {LpOff:8008 LpLen:45 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:4 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
id: 4

Now let's see if we can print out the name column. A lot of Googling surfaced a few interesting things. First, the in-memory and on-disk representations of the text data types differ. Text types are stored as varlenas: variable-length arrays. Varlenas are stored with the byte length first, and then all of the bytes next. The length includes the number of bytes required to encode the length itself. Long varlenas have a 4-byte length followed by all of the bytes of actual data. Short varlenas have a 1-byte length followed by all of the bytes of the actual data. This, of course, means we need a way of determining whether we are dealing with a short or long varlena. It took a lot of searching to find this gem in src/include/postgres.h:

/*
 * Bit layouts for varlena headers on big-endian machines:
 *
 * 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
 * 01xxxxxx 4-byte length word, aligned, *compressed* data (up to 1G)
 * 10000000 1-byte length word, unaligned, TOAST pointer
 * 1xxxxxxx 1-byte length word, unaligned, uncompressed data (up to 126b)
 *
 * Bit layouts for varlena headers on little-endian machines:
 *
 * xxxxxx00 4-byte length word, aligned, uncompressed data (up to 1G)
 * xxxxxx10 4-byte length word, aligned, *compressed* data (up to 1G)
 * 00000001 1-byte length word, unaligned, TOAST pointer
 * xxxxxxx1 1-byte length word, unaligned, uncompressed data (up to 126b)
 *
 * The "xxx" bits are the length field (which includes itself in all cases).
 * In the big-endian case we mask to extract the length, in the little-endian
 * case we shift.  Note that in both cases the flag bits are in the physically
 * first byte.  Also, it is not possible for a 1-byte length word to be zero;
 * this lets us disambiguate alignment padding bytes from the start of an
 * unaligned datum.  (We now *require* pad bytes to be filled with zero!)
 *
 * In TOAST pointers the va_tag field (see varattrib_1b_e) is used to discern
 * the specific type and length of the pointer datum.
 */

Because I know my sample data only have short varlenas, I'm just going to parse those, as a cheat, because this learning exercise has gone a bit long. (Question for later: does an empty string (not a null string, an empty string) take four bytes of storage, because it is not possible for a 1-byte length word to be zero?)

Here is how the name field is laid out: It's a short varlena, with one byte storing the length, followed by one byte for each char (because our example name, in UTF-8, requires only a byte for each char we are using in our string).

Name field of tuple (varlena where length byte = 5 and 5 chars follow):
┌────────┬────────┬────────┬────────┬────────┬────────┐
│ len    │  'A'   │  'l'   │  'i'   │  'c'   │  'e'   │
└────────┴────────┴────────┴────────┴────────┴────────┘

Here is the final iteration of the program. It parses out the varlena length and prints the name field.

package main

import (
	"bytes"
	"encoding/binary"
	"fmt"
	"io"
	"log"
	"os"
)

// PageHeader is based on pg's PageHeaderData struct
// from src/include/storage/bufpage.h.
type PageHeader struct {
	XLogID            uint32
	XRecOff           uint32
	PdChecksum        uint16
	PdFlags           uint16
	PdLower           uint16
	PdUpper           uint16
	PdSpecial         uint16
	PdPagesizeVersion uint16
	PdPruneXID        uint32
}

type RawItemIDData struct {
	LpOff uint16
	LpLen uint16
}

type ItemIDData struct {
	LpOff   uint16
	LpLen   uint16
	LpFlags byte
}

type RowHeader struct {
	Xmin uint32
	Xmax uint32
	// in Postgres's HeapTupleFields, t_cid overlaps with t_xvac
	CId uint32
	// Next 3 fields are ItemPointerData,
	// broken down into BlockIdDataHi, BlockIdDataLo, and OffsetNumber
	IPDBlockIdHi    uint16
	IPDBlockIdLo    uint16
	IPDOffsetNumber uint16
	// NumAttribs is also known as InfoMask2 because only 11 bits are used
	// to count the number of attributes, and the remaining 5 bits
	// are attribute bits.
	NumAttribs uint16
	InfoMask   uint16
	Hoff       byte
}

const PageHeaderByteSize int = 24
const ItemIDByteSize int = 4
const PageSize int = 8192

func main() {
	path := "/home/mwood/16567"

	file, err := os.Open(path)
	if err != nil {
		log.Fatalf("Could not open '%s': %v", path, err)
	}
	defer file.Close() // ignore possible close error

	pageBytes := make([]byte, PageSize)
	n, err := file.Read(pageBytes)
	if n != PageSize {
		log.Fatalf("Only read %d bytes of '%s'; expected to read %d bytes", n, path, PageSize)
	}
	if err != nil && err != io.EOF {
		log.Fatalf("Problem reading database page from '%s': %v", path, err)
	}

	pageReader := bytes.NewReader(pageBytes)

	header := &PageHeader{}
	err = binary.Read(pageReader, binary.LittleEndian, header)
	if err != nil {
		log.Fatalf("Could not read page header: %v", err)
	}

	fmt.Printf("page header: %+v\n", header)

	itemIDDataSize := (int(header.PdLower) - PageHeaderByteSize) / ItemIDByteSize

	fmt.Printf("Number of ItemIDData structs: %d\n", itemIDDataSize)

	itemIDDataPointers := make([]ItemIDData, itemIDDataSize)

	for i := 0; i < itemIDDataSize; i++ {
		raw := &RawItemIDData{}
		err = binary.Read(pageReader, binary.LittleEndian, raw)
		if err != nil {
			log.Fatalf("Could not read %dth ItemIDData: %v", i, err)
		}
		newItem := ItemIDData{
			LpOff: raw.LpOff & 0b_01111111_11111111,
			LpLen: raw.LpLen >> 1,
			LpFlags: byte(((raw.LpOff >> 15) & 0b_00000000_00000001) |
				((raw.LpLen << 1) & 0b_00000000_00000010)),
		}
		itemIDDataPointers[i] = newItem
	}

	for _, p := range itemIDDataPointers {
		fmt.Printf("item ID data pointer: %+v\n", p)
		rowReader := bytes.NewReader(pageBytes[p.LpOff:])
		rowHeader := &RowHeader{}
		err = binary.Read(rowReader, binary.LittleEndian, rowHeader)
		if err != nil {
			log.Fatalf("Could not read row header: %v", err)
		}
		fmt.Printf("row header: %+v\n", rowHeader)
		// According to PostgreSQL's src/include/access/htup_details.h,
		// only the first 11 bits of rowHeader.NumAttribs is used for
		// the number of attributes; the remaining 5 bits are used as flags.
		fmt.Printf("num attributes: %d\n", (rowHeader.NumAttribs & 0x07FF))

		// rowHeader.Hoff is the "header offset", which is the number of
		// bytes from the start of the header that we have to read until
		// we reach the first attribute.
		_, err = rowReader.Seek(int64(rowHeader.Hoff), io.SeekStart)
		if err != nil {
			log.Fatalf("Could not seek offset %d: %v", rowHeader.Hoff, err)
		}

		var id int64
		err = binary.Read(rowReader, binary.LittleEndian, &id)
		if err != nil {
			log.Fatalf("Could not read id column: %v", err)
		}
		fmt.Printf("id: %d\n", id)

// Hack: all of our strings are short, so read the next byte // and use it as a one-byte length for a PostgreSQL varlena. aLen, err := rowReader.ReadByte() if err != nil { log.Fatalf("Could not read first byte (varlena length) of name column: %v", err) } if aLen&0b00000001 != 1 { log.Fatalf("Last bit of varlena length needs to be 1, not 0.") } // According to src/include/postgres.h, the first bit of the length // of a short varlena is just a flag bit, and should be shifted away // before treating the remaining bits as the length. aLen = aLen >> 1 // The length includes the byte required to encode the length, but // we only need the number of bytes we need to read after the length. aLen = aLen - 1 text := make([]byte, aLen) n, err := rowReader.Read(text) if err != nil { log.Fatalf("Could not read name column: %v", err) } if n != int(aLen) { log.Fatalf("Could not read %d bytes of name column: %v", aLen, err) } fmt.Printf("name: %s\n", text) }
}

Prints:

page header: &{XLogID:0 XRecOff:24495032 PdChecksum:0 PdFlags:0 PdLower:40 PdUpper:8008 PdSpecial:8192 PdPagesizeVersion:8196 PdPruneXID:0}
Number of ItemIDData structs: 4
item ID data pointer: {LpOff:8152 LpLen:38 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:1 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
id: 1
name: Alice
item ID data pointer: {LpOff:8104 LpLen:45 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:2 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
id: 2
name: Cheshire Cat
item ID data pointer: {LpOff:8056 LpLen:42 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:3 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
id: 3
name: Red Queen
item ID data pointer: {LpOff:8008 LpLen:45 LpFlags:1}
row header: &{Xmin:552 Xmax:0 CId:0 IPDBlockIdHi:0 IPDBlockIdLo:0 IPDOffsetNumber:4 NumAttribs:2 InfoMask:2050 Hoff:24}
num attributes: 2
id: 4
name: White Rabbit