Skip to content

File formats in Hadoop ecosystem

Devendra Vishwakarma edited this page Sep 13, 2019 · 1 revision

What do you expect from any file format-

It should -

  • support fast read and write
  • be splittable so that it can be processed parallelly which is the main principle of distributed processing
  • support advanced compressions using various compression codes such as Bzip2, LZO, LZ4, Snappy etc. Compression is a must to save storage space when dealing with large data, especially in the big data world.
  • support schema evolution efficiently to allow us changing the schema of the file. (efficient schema evolution means the new schema should be able to read old files too)

Generally, it is seen that one file format does not provide all these features, there will always be a trade-off with one of the features i.e. if one is providing fast read then there will be slow write.


File formats available in Hadoop-

  • CSV/tsv (Text fles)
  • Sequence file (key-value based binary file format)
  • Avro (Row-based storage format)
  • Row Columnar (RC), Optimized Row Columnar (ORC), and Parquet (Columnar-based storage format)

CSV/tsv (Text fles)

Behaviour Each line is a record which are terminated by new line character \n
Read/Write Good for write but slow in reading
Compression Do not support block compression. Only file-level compression is there using gzip of zip
Splittable Splittable by new line character \n
Schema evolution Limited schema evolution (New fields can only be appended to the existing field and old fields can not be deleted)

Sequence file

Behaviour Each record is stored as a key-value pair in binary format. Due to binary format, the sequence file takes less space as compared to text file
Read/Write Because it is binary, it has faster read/write than text formatted files
Compression Support block-level compression
Splittable Yes
Schema evolution Limited schema evolution (New fields can only be appended to the existing field and old fields can not be deleted)

Avro (A row-based storage format)

Behaviour It is row-based storage format plus a SERDE framework. Row-based storage format means it stores data of rows alltogether i.e. adjacent to one another. Avro uses JSON for defining data types i.e. metadata and serialized data into compact binary format.
Read/Write Good for write operation but not efficient for reading large dataset because Avro stores data in a row-based format which need scanning all rows.
Compression Support block-level compression
Splittable Yes
Schema evolution Efficient schema evolution (fields can be renamed, deleted and added with old files still can still be read using the new schema)

RC (Row columnar)

Behaviour It is similar to the sequence file as it stores data in key-value pair in binary format, but stores then in a columnar fashion.
Read/Write Only for faster read
Compression Support good block-level compression with high compression ratio because it stores data in a columnar format where one column consists the same type of data and that can be compressed at good ratio.
Splittable Yes
Schema evolution No schema evolution

ORC is optimized version of RC which fits good with nested data. It was developed at Facebook.

Parquet (Columnar-based storage format)

Behaviour It stores data in a columnar fashion similar to RC and ORC. Parquet stores each column' data in separate flat file
Read/Write Good for read-intensive operation like analytical query (max,min,avg,sum,count). Write performance is not good as it requires to append each column's value to their respective flat file. Whereas in Avro write operation is fast because it only needs to append the new row at the end of the file.
Compression Support good block-level compression with high compression ratio because it stores data in a columnar format where one column consists the same type of data and that can be compressed at good ratio.
Splittable Yes but conditionally
Schema evolution Limited schema evolution (New fields can be appended to existing fields while the old field can not be deleted