-
Notifications
You must be signed in to change notification settings - Fork 1
File formats in Hadoop ecosystem
Devendra Vishwakarma edited this page Sep 13, 2019
·
1 revision
It should -
- support fast read and write
- be splittable so that it can be processed parallelly which is the main principle of distributed processing
- support advanced compressions using various compression codes such as Bzip2, LZO, LZ4, Snappy etc. Compression is a must to save storage space when dealing with large data, especially in the big data world.
- support schema evolution efficiently to allow us changing the schema of the file. (efficient schema evolution means the new schema should be able to read old files too)
Generally, it is seen that one file format does not provide all these features, there will always be a trade-off with one of the features i.e. if one is providing fast read then there will be slow write.
- CSV/tsv (Text fles)
- Sequence file (key-value based binary file format)
- Avro (Row-based storage format)
- Row Columnar (RC), Optimized Row Columnar (ORC), and Parquet (Columnar-based storage format)
Behaviour | Each line is a record which are terminated by new line character \n |
Read/Write | Good for write but slow in reading |
Compression | Do not support block compression. Only file-level compression is there using gzip of zip |
Splittable | Splittable by new line character \n |
Schema evolution | Limited schema evolution (New fields can only be appended to the existing field and old fields can not be deleted) |
Behaviour | Each record is stored as a key-value pair in binary format. Due to binary format, the sequence file takes less space as compared to text file |
Read/Write | Because it is binary, it has faster read/write than text formatted files |
Compression | Support block-level compression |
Splittable | Yes |
Schema evolution | Limited schema evolution (New fields can only be appended to the existing field and old fields can not be deleted) |
Behaviour | It is row-based storage format plus a SERDE framework. Row-based storage format means it stores data of rows alltogether i.e. adjacent to one another. Avro uses JSON for defining data types i.e. metadata and serialized data into compact binary format. |
Read/Write | Good for write operation but not efficient for reading large dataset because Avro stores data in a row-based format which need scanning all rows. |
Compression | Support block-level compression |
Splittable | Yes |
Schema evolution | Efficient schema evolution (fields can be renamed, deleted and added with old files still can still be read using the new schema) |
Behaviour | It is similar to the sequence file as it stores data in key-value pair in binary format, but stores then in a columnar fashion. |
Read/Write | Only for faster read |
Compression | Support good block-level compression with high compression ratio because it stores data in a columnar format where one column consists the same type of data and that can be compressed at good ratio. |
Splittable | Yes |
Schema evolution | No schema evolution |
Behaviour | It stores data in a columnar fashion similar to RC and ORC. Parquet stores each column' data in separate flat file |
Read/Write | Good for read-intensive operation like analytical query (max,min,avg,sum,count). Write performance is not good as it requires to append each column's value to their respective flat file. Whereas in Avro write operation is fast because it only needs to append the new row at the end of the file. |
Compression | Support good block-level compression with high compression ratio because it stores data in a columnar format where one column consists the same type of data and that can be compressed at good ratio. |
Splittable | Yes but conditionally |
Schema evolution | Limited schema evolution (New fields can be appended to existing fields while the old field can not be deleted |