
= Silk Compressed Archive (SICA) Development Note

Taro L. Saito <leo@xerial.org> 
November 25th, 2009

= Introduction

Silk Archiver is a compressor for the Silk data format, which contains both table and tree structured data. The compression command is:
<code>
> silk compress inputdata.silk
</code>
will generate inputdata.sica file.

== Note: File extension
What is the good file extention for the silk archive? 
* Silk Archive (.sx)
* Silk Folder  (.sf)
* Silk Compressed (.sc)

Or use different naming, such as..
* Textile (.txl) 
* Fabric (.fab)

How about the acronym?
* Silk Compressed Archive (.sica)  (I like this!)

== Goals

The goal of the silk archiver is to support:
* creating compact archives for tree/table structured data, and
* making the archives queryable.

Targetted queries include:
* Relation (table) extraction (via amoeba join)
* Interval intersection queries (via z-order index)
* Quick reference to the entries (via hash, b-tree index)

= Running Example

We obtained the following data from from UCSC's web site, and annotated the data with Silk format:
{b|refseq.silk}
<code>
% silk(version:1.0)
-seqid: hg19
 -gene(name, chr, strand, start, end, cds(start, end), num exon, exon(start), exon(end))|
NM_001033515	chr21	-	9907193	9968585	9908330	9966378	4	[9907193,9909046,9966321,9968515]	[9908432,9909277,9966380,9968585]
NM_199259	chr21	-	10906742	10990920	10906904	10973733	23	10906742,10908824,10910306,10914362,10916369,10920083,10921933,10933851,10934038,10934936,10941907,10942710,10942920,10944667,10951265,10952912,10959740,10970008,10971291,10973722,10985044,10987768,10990762,	10907040,10908895,10910399,10914442,10916475,10920164,10921995,10933940,10934120,10934997,10941972,10942774,10943020,10944787,10951427,10952963,10959800,10970062,10971345,10973776,10985102,10987877,10990920,
NM_199260	chr21	-	10906742	10990920	10906904	10973733	22	10906742,10908824,10910306,10914362,10916369,10920083,10921933,10933851,10934038,10934936,10941907,10942710,10942920,10944667,10951265,10952912,10970008,10971291,10973722,10985044,10987768,10990762,	10907040,10908895,10910399,10914442,10916475,10920164,10921995,10933940,10934120,10934997,10941972,10942774,10943020,10944787,10951427,10952963,10970062,10971345,10973776,10985102,10987877,10990920,
NM_199261	chr21	-	10906742	10990920	10906904	10973733	24	10906742,10908824,10910306,10914362,10916369,10920083,10921933,10933851,10934038,10934936,10941907,10942710,10942920,10944667,10951265,10952912,10959740,10969074,10970008,10971291,10973722,10985044,10987768,10990762,	10907040,10908895,10910399,10914442,10916475,10920164,10921995,10933940,10934120,10934997,10941972,10942774,10943020,10944787,10951427,10952963,10959800,10969128,10970062,10971345,10973776,10985102,10987877,10990920,
</code>


= Hadoop Tips 

Distributed Cache
 

Join

CompositeInputFormat 

MapFile

Simple Compression

Silk Data -> parse -> tree events -> extract relations -> sorting -> compress


Silk Data -> parse -> tree events -> extract relations -> sorting -> alternative XML structure -> compress




Silk Data -> split -> 


(map) (offset, line)  ->  (([block#], offset), SilkElement*)  -> (block#, E:SilkEnv) (reduce)

[] group key

E'1, E'2, E'3, E'4, ..., E'n

E1 = E'1
E2 = E1 + E'2
E3 = E2 + E'3
...

En = En-1 + E'n


Compute Env (E)

(map) (block#, E') -> (([1],block#), E'*) -> (block#, E) R (reduce) (collect all env)
 Compute E1 .... En in the reducer using intermediate SilkEnv (E'1, ..., E'n
)
Set the environment table E(b, E_b) to the DistributedCache


(map) (offset, line) -> (([block#], offset), SilkElement*) 
(reduce) -> (([context id], index key), relation)
(map)    -> ([(context id, index key)], relation*)    (identity map)
(reduce) -> (context id, alt_structure) (key-order preserving tree-design)




 







