<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

# Variant Binary Encoding

This directory contains binary artifacts encoded using the Parquet [Variant]
binary encoding. These files are **not** valid Parquet files, but rather
raw binary data. 

## Structure

* `data_dictionary.json` - contains the JSON representation for each example

Each example consists of 2 files:

* `.metadata` -- the binary contents of the `metadata` field
* `.value` -- the binary contents of the `value` field

## Descriptions

1. `primitive_<type>` -- Examples primitive (`basic_type` = 1), one for each of the [primitive types listed in the spec]
2. `short_string` -- Example of short string (`basic_type` = 2)
3. `object_empty` -- Example of object (`basic_type` = 3) with no fields
3. `object_primitive` -- Example of object with only primitive fields
4. `object_nested` -- Example of object with other objects in fields 
5. `array_empty` -- Example of array (`basic_type` = 4) with no elements
5. `array_primitive` -- Example of array with only primitive elements
6. `array_nested` -- Example of an with objects and other arrays in the elements

## Regenerating these files

The files in this directory were initially generated by running the [`regen.py`](regen.py) 
script which used Apache Spark to generate the files. The files have been subsequently modified
when necessary to ensure that they conform to the Parquet spec.

### Modification 1: Created metadata and value for `primitive_null` as a single byte (`0x01`) 

Per <https://github.com/apache/parquet-testing/issues/81>, Spark did not generate
any metadata for `null` and left `primitive_null.metadata` empty. 
The metadata for `primitive_null` should be the same 3 bytes as other primitive types 
* header = `0x01`
* dictionary_size = `0x00`
* `dictionary_size + 1 = 1` byte values: `0x00` 

```shell
cp primitive_int8.metadata primitive_null.metadata
```

The value for a primitive should be a `value_header` and no `value_data`,
resulting in a single `0` byte:

```shell
echo -n 'a' | tr a '\0' > primitive_null.value
```

### Modification 2: Created `TimeNTZ/Timestamp with timezone nanos/Timestamp without timezone nanos/UUID` with Iceberg test code

Currently, Spark [does not support](https://github.com/apache/spark/blob/master/common/variant/README.md) Variant values containing UUID, Time, or nanosecond-precision Timestamp. the `primitive_time.[metadata/value]`, `primitive_timestamp_nanos.[metadata/value]`, `primitive_timestampntz_nanos.[metadata/value]` and `primitive_uuid.[metadata/data]` was generated by [Iceberg test code](https://github.com/apache/iceberg/blob/3a4215dbb714477c89681ab94f1197b6ebcbdfff/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantReaders.java#L355)

[Variant]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
[primitive types listed in the spec]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-data-for-primitive-type-basic_type0
