# Binary Data (Blobs)

Blobs provide efficient binary data storage with typed layouts and zero-copy NumPy
integration.

**When to use**: Use blobs for binary data like images, meshes, and raw buffers.
`BlobArray` for typed arrays, `BlobPack` for structured multi-region data.

---

## Blob vs BlobId

Viper offers two DSM types for binary data:

| Type      | Use Case                       | Storage            |
|-----------|--------------------------------|--------------------|
| `blob`    | Small data (thumbnails, icons) | Inline in document |
| `blob_id` | Large data (textures, meshes)  | Database blob API  |

### blob (Inline)

A `blob` field stores binary data directly in the document. No special API needed.

### blob_id (Reference)

A `blob_id` is a SHA-1 hash referencing a blob managed by the Database blob API. This
requires the dedicated blob API:

```{doctest}
>>> layout = BlobLayout()
>>> content = ValueBlob(bytes([1, 2, 3, 4, 5]))

>>> blob_id = db.create_blob(layout, content)
>>> blob_id
f07d73c81ed1a91165a75c5cc22253cc7895a8b2

>>> db.blob(blob_id)
blob(5)

>>> blob_id in db.blob_ids()
True

>>> info = db.blob_info(blob_id)
>>> info.size()
5
```

**Content-addressable**: The `blob_id` is computed from layout + content (SHA-1).
Identical content always produces the same `blob_id`.

**Constraint**: A document referencing a `blob_id` cannot be committed unless the blob
exists in the database. Always call `create_blob()` before using the
`blob_id` in a document.

## ValueBlob

A `ValueBlob` holds inline binary data. To pull the bytes back out, use the
`bytes()` builtin (the buffer protocol) — there is no `.bytes()` method:

```{doctest}
>>> data = bytes([1, 2, 3, 4, 5])
>>> blob = ValueBlob(data)

>>> bytes(blob)
b'\x01\x02\x03\x04\x05'

>>> len(blob)
5
```

## ValueBlobId

A `ValueBlobId` references external binary data. The id is computed from the layout
and content (deterministic SHA-1):

```{doctest}
>>> layout = BlobLayout()
>>> content = ValueBlob(bytes([1, 2, 3, 4]))
>>> blob_id = ValueBlobId(layout, content)
>>> blob_id
6b8f3ca756046be29244d9bdb6b5ca5c00468ad5

>>> ValueBlobId.try_parse("6b8f3ca756046be29244d9bdb6b5ca5c00468ad5")
6b8f3ca756046be29244d9bdb6b5ca5c00468ad5
```

## BlobLayout - Metadata Everywhere

A `BlobLayout` describes how to interpret blob bytes. This is the **Metadata Everywhere**
principle applied to binary data: the layout is metadata that gives meaning to raw bytes.

```{doctest}
>>> BlobLayout()
'uchar-1'

>>> BlobLayout('float', 3)
'float-3'

>>> BlobLayout('uint', 3)
'uint-3'

>>> BlobLayout('float', 2)
'float-2'
```

The layout enables:

- Type-safe interpretation of binary data
- Cross-platform compatibility (endianness handled)
- Validation at decode time

## BlobView (Read-Only)

A `BlobView` interprets an existing blob with a given layout (read-only). When
the layout has more than one component per element, indexing returns a tuple:

```{doctest}
>>> import struct
>>> _buf = bytearray()
>>> for i in range(100):
...     _ = _buf.extend(struct.pack('<fff', float(i), float(i*2), float(i*3)))
>>> raw = ValueBlob(bytes(_buf))

>>> view = BlobView(BlobLayout('float', 3), raw)
>>> view.count()
100
>>> view[0]
(0.0, 0.0, 0.0)
>>> view[99]
(99.0, 198.0, 297.0)
```

Use `BlobView` when you need to read blob data without copying.

## BlobArray (Read-Write)

A `BlobArray` is a typed array backed by a blob. Writes and reads use the
NumPy buffer protocol — the array exposes a *flat* `(N*components,)` view, so
reshape it to `(N, components)` to assign per-element tuples:

```{doctest}
>>> import numpy as np
>>> layout = BlobLayout('float', 3)
>>> array = BlobArray(layout, 100)

>>> np_view = np.array(array, copy=False).reshape(100, 3)
>>> np_view[0] = [1.0, 2.0, 3.0]
>>> np_view[1] = [4.0, 5.0, 6.0]

>>> view = BlobView(layout, array.blob())
>>> view[0]
(1.0, 2.0, 3.0)
>>> view[1]
(4.0, 5.0, 6.0)
```

## BlobPack - Structured Binary Data

A `BlobPack` groups multiple named regions with different layouts into a single blob. This
is ideal for complex structures like 3D meshes.

### Example: 3D Mesh Storage

A mesh has positions, normals, UVs, and triangle indices — each with a different layout.
Define the structure with a descriptor, then create the pack:

```{doctest}
>>> descriptor = BlobPackDescriptor()
>>> descriptor.add_region('positions', BlobLayout('float', 3), 4)
>>> descriptor.add_region('normals', BlobLayout('float', 3), 4)
>>> descriptor.add_region('uvs', BlobLayout('float', 2), 4)
>>> descriptor.add_region('indices', BlobLayout('uint', 3), 2)

>>> mesh = BlobPack(descriptor)
>>> len(mesh)
4
```

### Fill the Mesh Data

Each region exposes a flat NumPy view; reshape to write per-vertex tuples:

```{doctest}
>>> import numpy as np
>>> pos = np.array(mesh['positions'], copy=False).reshape(4, 3)
>>> pos[:] = [[-1.0, -1.0, 0.0], [1.0, -1.0, 0.0],
...           [ 1.0,  1.0, 0.0], [-1.0,  1.0, 0.0]]

>>> normals = np.array(mesh['normals'], copy=False).reshape(4, 3)
>>> normals[:] = [[0.0, 0.0, 1.0]] * 4

>>> uvs = np.array(mesh['uvs'], copy=False).reshape(4, 2)
>>> uvs[:] = [[0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]]

>>> indices = np.array(mesh['indices'], copy=False).reshape(2, 3)
>>> indices[:] = [[0, 1, 2], [0, 2, 3]]
```

### Serialize and Restore

```{doctest}
>>> blob = mesh.blob()
>>> restored = BlobPack.from_blob(blob)

>>> np.array(restored['positions'], copy=False).reshape(4, 3)[0].tolist()
[-1.0, -1.0, 0.0]
>>> np.array(restored['indices'], copy=False).reshape(2, 3)[1].tolist()
[0, 2, 3]
```

### Region Access

```{doctest}
>>> 'positions' in mesh
True
>>> 'colors' in mesh
False

>>> mesh['positions'].name()
'positions'
>>> mesh['positions'].count()
4
>>> mesh['positions'].blob_layout()
'float-3'
>>> mesh['positions'].data_count()
12
>>> mesh['positions'].byte_count()
48
```

### Why BlobPack?

| Benefit             | Description                        |
|---------------------|------------------------------------|
| **Single blob**     | All mesh data in one blob_id       |
| **Typed regions**   | Each region has its own layout     |
| **Self-describing** | Layout metadata embedded in header |
| **Efficient**       | Direct memory mapping, no parsing  |

## NumPy Integration

`BlobArray` implements the Python Buffer Protocol, enabling zero-copy interoperability
with NumPy and other array libraries.

### Zero-Copy View

The buffer is exposed as a flat 1D array of components — reshape to give it
the geometric shape you want:

```{doctest}
>>> import numpy as np
>>> layout = BlobLayout('float', 3)
>>> positions = BlobArray(layout, 100)

>>> np_view = np.array(positions, copy=False)
>>> np_view.shape
(300,)
>>> np_view.dtype
dtype('float32')

>>> np_view.reshape(100, 3).shape
(100, 3)
```

### Bidirectional Modifications

Changes through NumPy affect the original BlobArray:

```{doctest}
>>> reshaped = np.array(positions, copy=False).reshape(100, 3)
>>> reshaped[0] = [10.0, 20.0, 30.0]

>>> view = BlobView(layout, positions.blob())
>>> view[0]
(10.0, 20.0, 30.0)
```

### Direct Memory Access

```{doctest}
>>> mv = memoryview(positions)
>>> mv.nbytes
1200
```

### BlobPack Regions

`BlobPackRegion` also supports the Buffer Protocol — same flat-then-reshape
idiom:

```{doctest}
>>> positions_np = np.array(mesh['positions'], copy=False).reshape(4, 3)
>>> normals_np = np.array(mesh['normals'], copy=False).reshape(4, 3)
>>> uvs_np = np.array(mesh['uvs'], copy=False).reshape(4, 2)

>>> positions_np.shape
(4, 3)
>>> uvs_np.shape
(4, 2)

>>> positions_np *= 2.0
>>> positions_np[0].tolist()
[-2.0, -2.0, 0.0]
```

### Why Zero-Copy Matters

| Scenario           | Without Zero-Copy   | With Zero-Copy |
|--------------------|---------------------|----------------|
| 1M vertices        | Copy 12MB           | Share pointer  |
| GPU upload         | Python → copy → C++ | Direct access  |
| Scientific compute | Data duplication    | In-place ops   |

## Blob in Attachments

Blobs can be used in attachments:

```dsm
// Small data inline
struct Thumbnail {
uint16 width;
uint16 height;
blob data;
// Inline binary
};

// Large data by reference
struct Texture {
uint16 width;
uint16 height;
blob_id pixels;
// External reference
};
```

## Storing Blobs in Database

### Inline Blobs

Inline blobs are stored directly in the document. The Tuto fixture
exposes a `Thumbnail` struct with a `blob` field, attached to `User` as
`avatar`:

```{doctest}
>>> user_key = TUTO_A_USER_AVATAR.create_key()

>>> thumb = TUTO_A_USER_AVATAR.create_document()
>>> thumb.width = 64
>>> thumb.height = 64
>>> thumb.data = ValueBlob(bytes([1, 2, 3, 4, 5]))
>>> thumb
{width=64, height=64, data=blob(5)}

>>> ms = CommitMutableState(db.initial_state())
>>> ms.attachment_mutating().set(TUTO_A_USER_AVATAR, user_key, thumb)
>>> avatar_commit = db.commit_mutations("Add avatar", ms)

>>> db.state(avatar_commit).attachment_getting().get(TUTO_A_USER_AVATAR, user_key)
Optional({width=64, height=64, data=blob(5)})
```

### Referenced Blobs (blob_id)

Use the database blob API to store and retrieve. The Tuto fixture
exposes a `Texture` struct with a `blob_id` field, attached to `User`
as `portrait`:

```{doctest}
>>> mesh_layout = BlobLayout()
>>> mesh_content = ValueBlob(bytes([10, 20, 30, 40]))
>>> texture_blob_id = db.create_blob(mesh_layout, mesh_content)

>>> texture = TUTO_A_USER_PORTRAIT.create_document()
>>> texture.width = 1024
>>> texture.height = 1024
>>> texture.pixels = texture_blob_id

>>> ms = CommitMutableState(db.state(avatar_commit))
>>> ms.attachment_mutating().set(TUTO_A_USER_PORTRAIT, user_key, texture)
>>> portrait_commit = db.commit_mutations("Add portrait", ms)

>>> db.state(portrait_commit).attachment_getting().get(TUTO_A_USER_PORTRAIT, user_key)
Optional({width=1024, height=1024, pixels=...})
```

### Retrieving Blobs

```{doctest}
>>> stored_id = db.create_blob(BlobLayout(), ValueBlob(bytes([10, 20, 30, 40, 50])))
>>> bytes(db.blob(stored_id))
b'\n\x14\x1e(2'

>>> db.read_blob(stored_id, size=2, offset=0)
blob(2)
```

### BlobStream (Large Blobs)

For very large blobs, use streaming to avoid loading everything in memory.

**Required for blobs > 2GB**: The standard `create_blob()` API has a 2GB size limit. Use
BlobStream for larger data:

```{doctest}
>>> stream = db.blob_stream_create(BlobLayout('uchar', 1), size=10)

>>> db.blob_stream_append(stream, ValueBlob(bytes([1, 2, 3, 4, 5])))
>>> db.blob_stream_append(stream, ValueBlob(bytes([6, 7, 8, 9, 10])))

>>> stream_blob_id = db.blob_stream_close(stream)
>>> bytes(db.blob(stream_blob_id))
b'\x01\x02\x03\x04\x05\x06\x07\x08\t\n'
```

This is essential for:

- 3D meshes with millions of vertices
- Video/audio data

## When to Use Each Type

| Scenario            | Recommendation                    |
|---------------------|-----------------------------------|
| Thumbnails (< 64KB) | Use `blob` (inline)               |
| Textures (> 1MB)    | Use `blob_id` (Database blob API) |
| Mesh geometry       | Use `blob_id` (Database blob API) |
| Icons, small images | Use `blob` (inline)               |
| Audio/video         | Use `blob_id` (Database blob API) |

## What's Next

- [Serialization](serialization.md) - JSON and binary encoding