Iceberg extension
The iceberg extension adds support for scanning and copying from the Apache Iceberg format.
Iceberg is an open-source table format originally developed at Netflix for large-scale analytical datasets.
Usage
Please see Install an extension and Load an extension first before getting started.
Example dataset
Download iceberg_tables.zip and unzip
it to the /tmp directory.
cd /tmpwget https://lbugdb.github.io/data/iceberg-extension/iceberg_tables.zipunzip iceberg_tables.zipScan Iceberg tables
LOAD FROM is a Cypher query that scans a file or object element by element, but doesn’t actually
copy the data into a Ladybug table.
LOAD FROM '/tmp/iceberg_tables/university' (file_format='iceberg', allow_moved_paths=true)RETURN *;┌────────────┬──────┬──────────┐│ University │ Rank │ Funding │├────────────┼──────┼──────────┤│ Stanford │ 2 │ 250.300 ││ Yale │ 6 │ 190.700 ││ Harvard │ 1 │ 210.500 ││ Cambridge │ 5 │ 280.200 ││ MIT │ 3 │ 170.000 ││ Oxford │ 4 │ 300.000 │└────────────┴──────┴──────────┘Copy Iceberg tables into Ladybug
You can use a COPY FROM statement to copy the contents of an Iceberg table into Ladybug.
CREATE NODE TABLE university (name STRING PRIMARY KEY, age INT64);COPY university FROM '/tmp/iceberg_tables/person_table' (file_format='iceberg', allow_moved_paths=true);┌─────────────────────────────────────────────────────┐│ result ││ STRING │├─────────────────────────────────────────────────────┤│ 6 tuples have been copied to the university table. │└─────────────────────────────────────────────────────┘Access Iceberg metadata
At the heart of Iceberg’s table structure is the metadata, which tracks everything from the schema, to partition information, and snapshots of the table’s state.
The ICEBERG_METADATA function lists the metadata files for an Iceberg table.
CALL ICEBERG_METADATA( '/tmp/iceberg_tables/lineitem_iceberg', allow_moved_paths := true)RETURN *;┌──────────────────────────┬──────────────────────────┬──────────────────┬─────────┬──────────┬──────────────────────────┬─────────────┬──────────────┐│ manifest_path │ manifest_sequence_number │ manifest_content │ status │ content │ file_path │ file_format │ record_count ││ STRING │ INT64 │ STRING │ STRING │ STRING │ STRING │ STRING │ INT64 │├──────────────────────────┼──────────────────────────┼──────────────────┼─────────┼──────────┼──────────────────────────┼─────────────┼──────────────┤│ lineitem_iceberg/meta... │ 2 │ DATA │ ADDED │ EXISTING │ lineitem_iceberg/data... │ PARQUET │ 51793 ││ lineitem_iceberg/meta... │ 2 │ DATA │ DELETED │ EXISTING │ lineitem_iceberg/data... │ PARQUET │ 60175 │└──────────────────────────┴──────────────────────────┴──────────────────┴─────────┴──────────┴──────────────────────────┴─────────────┴──────────────┘List Iceberg snapshots
Iceberg tables maintain a series of snapshots, which are consistent views of the table at a specific point in time. Snapshots are the core of Iceberg’s versioning system, allowing you to track, query, and manage changes to your table over time.
The ICEBERG_SNAPSHOTS function lists the snapshots for an Iceberg table.
Note that for snapshots, you do not need to specify the allow_moved_paths option.
CALL ICEBERG_SNAPSHOTS('/tmp/iceberg_tables/lineitem_iceberg') RETURN *;┌─────────────────┬─────────────────────┬─────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────┐│ sequence_number │ snapshot_id │ timestamp_ms │ manifest_list ││ UINT64 │ UINT64 │ TIMESTAMP │ STRING │├─────────────────┼─────────────────────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────┤│ 1 │ 3776207205136740581 │ 2023-02-15 15:07:54.504 │ lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro ││ 2 │ 7635660646343998149 │ 2023-02-15 15:08:14.73 │ lineitem_iceberg/metadata/snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro │└─────────────────┴─────────────────────┴─────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────┘Access Iceberg tables hosted on S3
Ladybug also supports scanning and copying Iceberg tables hosted on S3.
Configure the S3 connection
Before reading and writing from S3, you have to configure the connection using a CALL statement.
CALL <option_name>='<option_value>'| Option name | Description |
|---|---|
s3_access_key_id | S3 access key id |
s3_secret_access_key | S3 secret access key |
s3_endpoint | S3 endpoint |
s3_region | S3 region |
s3_url_style | Uses S3 url style (should either be vhost or path) |
Requirements on the S3 server APIs
| Feature | Required S3 API features |
|---|---|
| Public file reads | HTTP Range request |
| Private file reads | Secret key authentication |
Scan Iceberg tables from S3
LOAD FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_moved_paths=true)RETURN *;Copy Iceberg tables from S3 into Ladybug
CREATE NODE TABLE student (ID INT64 PRIMARY KEY, name STRING);COPY student FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_moved_paths=true);Optional parameters
The following optional parameters are supported when using the functions from the iceberg extension.
allow_moved_paths
- Type:
BOOLEAN - Default:
false
Allows scanning Iceberg tables that are not located in their original directory.
metadata_compression_codec
- Type:
STRING - Allowed values:
gzip - Default:
''
By default, this extension will look for v{version}.metadata.json and {version}.metadata.json files for metadata.
When metadata_compression_codec = 'gzip' is specified, it will look for v{version}.gz.metadata.json and {version}.gz.metadata.json files instead.
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_gz' ( file_format='iceberg', allow_moved_paths=true, metadata_compression_codec = 'gzip')RETURN *;version
- Type:
STRING - Default: determined from
version-hint.txt
You can specify an explicit Iceberg metadata version:
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg' ( file_format='iceberg', allow_moved_paths=true, version='2')RETURN *;version_name_format
- Type:
STRING - Default:
'v%s%s.metadata.json,%s%s.metadata.json'
You can specify a custom metadata file name format.
For example, if your metadata is named as rev-2.metadata.json:
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_alter_name' ( file_format='iceberg', allow_moved_paths=true, version_name_format = 'rev-%s.metadata.json')RETURN *;Limitations
Currently, the iceberg extension does not support:
- Exporting to Iceberg tables from Ladybug is not supported.
- Scanning/copying nested data (i.e., of type
STRUCT) in Iceberg table columns is not supported.