File formats in Hadoop ecosystem

What do you expect from any file format-

It should -

support fast read and write
be splittable so that it can be processed parallelly which is the main principle of distributed processing
support advanced compressions using various compression codes such as Bzip2, LZO, LZ4, Snappy etc. Compression is a must to save storage space when dealing with large data, especially in the big data world.
support schema evolution efficiently to allow us changing the schema of the file. (efficient schema evolution means the new schema should be able to read old files too)

Generally, it is seen that one file format does not provide all these features, there will always be a trade-off with one of the features i.e. if one is providing fast read then there will be slow write.

File formats available in Hadoop-

CSV/tsv (Text fles)
Sequence file (key-value based binary file format)
Avro (Row-based storage format)
Row Columnar (RC), Optimized Row Columnar (ORC), and Parquet (Columnar-based storage format)

CSV/tsv (Text fles)


Behaviour	Each line is a record which are terminated by new line character \n
Read/Write	Good for write but slow in reading
Compression	Do not support block compression. Only file-level compression is there using gzip of zip
Splittable	Splittable by new line character \n
Schema evolution	Limited schema evolution (New fields can only be appended to the existing field and old fields can not be deleted)

Sequence file


Behaviour	Each record is stored as a key-value pair in binary format. Due to binary format, the sequence file takes less space as compared to text file
Read/Write	Because it is binary, it has faster read/write than text formatted files
Compression	Support block-level compression
Splittable	Yes
Schema evolution	Limited schema evolution (New fields can only be appended to the existing field and old fields can not be deleted)

Avro (A row-based storage format)


Behaviour	It is row-based storage format plus a SERDE framework. Row-based storage format means it stores data of rows alltogether i.e. adjacent to one another. Avro uses JSON for defining data types i.e. metadata and serialized data into compact binary format.
Read/Write	Good for write operation but not efficient for reading large dataset because Avro stores data in a row-based format which need scanning all rows.
Compression	Support block-level compression
Splittable	Yes
Schema evolution	Efficient schema evolution (fields can be renamed, deleted and added with old files still can still be read using the new schema)

RC (Row columnar)


Behaviour	It is similar to the sequence file as it stores data in key-value pair in binary format, but stores then in a columnar fashion.
Read/Write	Only for faster read
Compression	Support good block-level compression with high compression ratio because it stores data in a columnar format where one column consists the same type of data and that can be compressed at good ratio.
Splittable	Yes
Schema evolution	No schema evolution

ORC is optimized version of RC which fits good with nested data. It was developed at Facebook.

Parquet (Columnar-based storage format)


Behaviour	It stores data in a columnar fashion similar to RC and ORC. Parquet stores each column' data in separate flat file
Read/Write	Good for read-intensive operation like analytical query (max,min,avg,sum,count). Write performance is not good as it requires to append each column's value to their respective flat file. Whereas in Avro write operation is fast because it only needs to append the new row at the end of the file.
Compression	Support good block-level compression with high compression ratio because it stores data in a columnar format where one column consists the same type of data and that can be compressed at good ratio.
Splittable	Yes but conditionally
Schema evolution	Limited schema evolution (New fields can be appended to existing fields while the old field can not be deleted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

File formats in Hadoop ecosystem

What do you expect from any file format-

File formats available in Hadoop-

CSV/tsv (Text fles)

Sequence file

Avro (A row-based storage format)

RC (Row columnar)

ORC is optimized version of RC which fits good with nested data. It was developed at Facebook.

Parquet (Columnar-based storage format)

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally