Package org.apache.parquet.hadoop
Class ParquetFileWriter
java.lang.Object
org.apache.parquet.hadoop.ParquetFileWriter
Internal implementation of the Parquet file writer as a block container
Note: this is temporary Drill-Parquet class needed to write empty parquet files. Details in PARQUET-2026 and DRILL-7907
Note: this is temporary Drill-Parquet class needed to write empty parquet files. Details in PARQUET-2026 and DRILL-7907
-
Nested Class Summary
-
Field Summary
-
Constructor Summary
ConstructorDescriptionParquetFileWriter
(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file) Deprecated.will be removed in 2.0.0ParquetFileWriter
(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode) Deprecated.will be removed in 2.0.0ParquetFileWriter
(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize) Deprecated.will be removed in 2.0.0ParquetFileWriter
(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize) Deprecated.will be removed in 2.0.0ParquetFileWriter
(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled) ParquetFileWriter
(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled, org.apache.parquet.crypto.FileEncryptionProperties encryptionProperties) -
Method Summary
Modifier and TypeMethodDescriptionvoid
appendColumnChunk
(org.apache.parquet.column.ColumnDescriptor descriptor, org.apache.parquet.io.SeekableInputStream from, org.apache.parquet.hadoop.metadata.ColumnChunkMetaData chunk, org.apache.parquet.column.values.bloomfilter.BloomFilter bloomFilter, org.apache.parquet.internal.column.columnindex.ColumnIndex columnIndex, org.apache.parquet.internal.column.columnindex.OffsetIndex offsetIndex) void
appendFile
(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file) Deprecated.will be removed in 2.0.0; useappendFile(InputFile)
insteadvoid
appendFile
(org.apache.parquet.io.InputFile file) void
appendRowGroup
(org.apache.hadoop.fs.FSDataInputStream from, org.apache.parquet.hadoop.metadata.BlockMetaData rowGroup, boolean dropColumns) Deprecated.will be removed in 2.0.0; useappendRowGroup(SeekableInputStream,BlockMetaData,boolean)
insteadvoid
appendRowGroup
(org.apache.parquet.io.SeekableInputStream from, org.apache.parquet.hadoop.metadata.BlockMetaData rowGroup, boolean dropColumns) void
appendRowGroups
(org.apache.hadoop.fs.FSDataInputStream file, List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups, boolean dropColumns) Deprecated.will be removed in 2.0.0; useappendRowGroups(SeekableInputStream,List,boolean)
insteadvoid
appendRowGroups
(org.apache.parquet.io.SeekableInputStream file, List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups, boolean dropColumns) void
ends a file once all blocks have been written.void
endBlock()
ends a block once all column chunks have been writtenvoid
end a column (once all rep, def and data have been written)org.apache.parquet.hadoop.metadata.ParquetMetadata
long
long
getPos()
static org.apache.parquet.hadoop.metadata.ParquetMetadata
mergeMetadataFiles
(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf) Deprecated.metadata files are not recommended and will be removed in 2.0.0static org.apache.parquet.hadoop.metadata.ParquetMetadata
mergeMetadataFiles
(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf, org.apache.parquet.hadoop.metadata.KeyValueMetadataMergeStrategy keyValueMetadataMergeStrategy) Deprecated.metadata files are not recommended and will be removed in 2.0.0void
start()
start the filevoid
startBlock
(long recordCount) start a blockvoid
startColumn
(org.apache.parquet.column.ColumnDescriptor descriptor, long valueCount, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName) start a column inside a blockvoid
writeDataPage
(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) Deprecated.void
writeDataPage
(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, long rowCount, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) Writes a single pagevoid
writeDataPage
(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) Deprecated.this method does not support writing column indexes; UsewriteDataPage(int, int, BytesInput, Statistics, long, Encoding, Encoding, Encoding)
insteadvoid
writeDataPageV2
(int rowCount, int nullCount, int valueCount, org.apache.parquet.bytes.BytesInput repetitionLevels, org.apache.parquet.bytes.BytesInput definitionLevels, org.apache.parquet.column.Encoding dataEncoding, org.apache.parquet.bytes.BytesInput compressedData, int uncompressedDataSize, org.apache.parquet.column.statistics.Statistics<?> statistics) Writes a single v2 data pagevoid
writeDictionaryPage
(org.apache.parquet.column.page.DictionaryPage dictionaryPage) writes a dictionary page pagevoid
writeDictionaryPage
(org.apache.parquet.column.page.DictionaryPage dictionaryPage, org.apache.parquet.format.BlockCipher.Encryptor headerBlockEncryptor, byte[] AAD) static void
writeMergedMetadataFile
(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.conf.Configuration conf) Deprecated.metadata files are not recommended and will be removed in 2.0.0static void
writeMetadataFile
(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<org.apache.parquet.hadoop.Footer> footers) Deprecated.metadata files are not recommended and will be removed in 2.0.0static void
writeMetadataFile
(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<org.apache.parquet.hadoop.Footer> footers, org.apache.parquet.hadoop.ParquetOutputFormat.JobSummaryLevel level) Deprecated.metadata files are not recommended and will be removed in 2.0.0
-
Field Details
-
PARQUET_METADATA_FILE
- See Also:
-
MAGIC_STR
- See Also:
-
MAGIC
public static final byte[] MAGIC -
EF_MAGIC_STR
- See Also:
-
EFMAGIC
public static final byte[] EFMAGIC -
PARQUET_COMMON_METADATA_FILE
- See Also:
-
CURRENT_VERSION
public static final int CURRENT_VERSION- See Also:
-
out
protected final org.apache.parquet.io.PositionOutputStream out
-
-
Constructor Details
-
ParquetFileWriter
@Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file) throws IOException Deprecated.will be removed in 2.0.0- Parameters:
configuration
- Hadoop configurationschema
- the schema of the datafile
- the file to write to- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
@Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode) throws IOException Deprecated.will be removed in 2.0.0- Parameters:
configuration
- Hadoop configurationschema
- the schema of the datafile
- the file to write tomode
- file creation mode- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
@Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize) throws IOException Deprecated.will be removed in 2.0.0- Parameters:
configuration
- Hadoop configurationschema
- the schema of the datafile
- the file to write tomode
- file creation moderowGroupSize
- the row group sizemaxPaddingSize
- the maximum padding- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
@Deprecated public ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize) throws IOException Deprecated.will be removed in 2.0.0- Parameters:
file
- OutputFile to create or overwriteschema
- the schema of the datamode
- file creation moderowGroupSize
- the row group sizemaxPaddingSize
- the maximum padding- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
public ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled) throws IOException - Parameters:
file
- OutputFile to create or overwriteschema
- the schema of the datamode
- file creation moderowGroupSize
- the row group sizemaxPaddingSize
- the maximum paddingcolumnIndexTruncateLength
- the length which the min/max values in column indexes tried to be truncated tostatisticsTruncateLength
- the length which the min/max values in row groups tried to be truncated topageWriteChecksumEnabled
- whether to write out page level checksums- Throws:
IOException
- if the file can not be created
-
ParquetFileWriter
public ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled, org.apache.parquet.crypto.FileEncryptionProperties encryptionProperties) throws IOException - Throws:
IOException
-
-
Method Details
-
start
start the file- Throws:
IOException
- if there is an error while writing
-
startBlock
start a block- Parameters:
recordCount
- the record count in this block- Throws:
IOException
- if there is an error while writing
-
startColumn
public void startColumn(org.apache.parquet.column.ColumnDescriptor descriptor, long valueCount, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName) throws IOException start a column inside a block- Parameters:
descriptor
- the column descriptorvalueCount
- the value count in this columncompressionCodecName
- a compression codec name- Throws:
IOException
- if there is an error while writing
-
writeDictionaryPage
public void writeDictionaryPage(org.apache.parquet.column.page.DictionaryPage dictionaryPage) throws IOException writes a dictionary page page- Parameters:
dictionaryPage
- the dictionary page- Throws:
IOException
- if there is an error while writing
-
writeDictionaryPage
public void writeDictionaryPage(org.apache.parquet.column.page.DictionaryPage dictionaryPage, org.apache.parquet.format.BlockCipher.Encryptor headerBlockEncryptor, byte[] AAD) throws IOException - Throws:
IOException
-
writeDataPage
@Deprecated public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) throws IOException Deprecated.writes a single page- Parameters:
valueCount
- count of valuesuncompressedPageSize
- the size of the data once uncompressedbytes
- the compressed data for the page without headerrlEncoding
- encoding of the repetition leveldlEncoding
- encoding of the definition levelvaluesEncoding
- encoding of values- Throws:
IOException
- if there is an error while writing
-
writeDataPage
@Deprecated public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) throws IOException Deprecated.this method does not support writing column indexes; UsewriteDataPage(int, int, BytesInput, Statistics, long, Encoding, Encoding, Encoding)
insteadwrites a single page- Parameters:
valueCount
- count of valuesuncompressedPageSize
- the size of the data once uncompressedbytes
- the compressed data for the page without headerstatistics
- statistics for the pagerlEncoding
- encoding of the repetition leveldlEncoding
- encoding of the definition levelvaluesEncoding
- encoding of values- Throws:
IOException
- if there is an error while writing
-
writeDataPage
public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, long rowCount, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) throws IOException Writes a single page- Parameters:
valueCount
- count of valuesuncompressedPageSize
- the size of the data once uncompressedbytes
- the compressed data for the page without headerstatistics
- the statistics of the pagerowCount
- the number of rows in the pagerlEncoding
- encoding of the repetition leveldlEncoding
- encoding of the definition levelvaluesEncoding
- encoding of values- Throws:
IOException
- if any I/O error occurs during writing the file
-
writeDataPageV2
public void writeDataPageV2(int rowCount, int nullCount, int valueCount, org.apache.parquet.bytes.BytesInput repetitionLevels, org.apache.parquet.bytes.BytesInput definitionLevels, org.apache.parquet.column.Encoding dataEncoding, org.apache.parquet.bytes.BytesInput compressedData, int uncompressedDataSize, org.apache.parquet.column.statistics.Statistics<?> statistics) throws IOException Writes a single v2 data page- Parameters:
rowCount
- count of rowsnullCount
- count of nullsvalueCount
- count of valuesrepetitionLevels
- repetition level bytesdefinitionLevels
- definition level bytesdataEncoding
- encoding for datacompressedData
- compressed data bytesuncompressedDataSize
- the size of uncompressed datastatistics
- the statistics of the page- Throws:
IOException
- if any I/O error occurs during writing the file
-
endColumn
end a column (once all rep, def and data have been written)- Throws:
IOException
- if there is an error while writing
-
endBlock
ends a block once all column chunks have been written- Throws:
IOException
- if there is an error while writing
-
appendFile
@Deprecated public void appendFile(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file) throws IOException Deprecated.will be removed in 2.0.0; useappendFile(InputFile)
instead- Parameters:
conf
- a configurationfile
- a file path to append the contents of to this file- Throws:
IOException
- if there is an error while reading or writing
-
appendFile
- Throws:
IOException
-
appendRowGroups
@Deprecated public void appendRowGroups(org.apache.hadoop.fs.FSDataInputStream file, List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups, boolean dropColumns) throws IOException Deprecated.will be removed in 2.0.0; useappendRowGroups(SeekableInputStream,List,boolean)
instead- Parameters:
file
- a file stream to read fromrowGroups
- row groups to copydropColumns
- whether to drop columns from the file that are not in this file's schema- Throws:
IOException
- if there is an error while reading or writing
-
appendRowGroups
public void appendRowGroups(org.apache.parquet.io.SeekableInputStream file, List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups, boolean dropColumns) throws IOException - Throws:
IOException
-
appendRowGroup
@Deprecated public void appendRowGroup(org.apache.hadoop.fs.FSDataInputStream from, org.apache.parquet.hadoop.metadata.BlockMetaData rowGroup, boolean dropColumns) throws IOException Deprecated.will be removed in 2.0.0; useappendRowGroup(SeekableInputStream,BlockMetaData,boolean)
instead- Parameters:
from
- a file stream to read fromrowGroup
- row group to copydropColumns
- whether to drop columns from the file that are not in this file's schema- Throws:
IOException
- if there is an error while reading or writing
-
appendRowGroup
public void appendRowGroup(org.apache.parquet.io.SeekableInputStream from, org.apache.parquet.hadoop.metadata.BlockMetaData rowGroup, boolean dropColumns) throws IOException - Throws:
IOException
-
appendColumnChunk
public void appendColumnChunk(org.apache.parquet.column.ColumnDescriptor descriptor, org.apache.parquet.io.SeekableInputStream from, org.apache.parquet.hadoop.metadata.ColumnChunkMetaData chunk, org.apache.parquet.column.values.bloomfilter.BloomFilter bloomFilter, org.apache.parquet.internal.column.columnindex.ColumnIndex columnIndex, org.apache.parquet.internal.column.columnindex.OffsetIndex offsetIndex) throws IOException - Parameters:
descriptor
- the descriptor for the target columnfrom
- a file stream to read fromchunk
- the column chunk to be copiedbloomFilter
- the bloomFilter for this chunkcolumnIndex
- the column index for this chunkoffsetIndex
- the offset index for this chunk- Throws:
IOException
-
end
ends a file once all blocks have been written. closes the file.- Parameters:
extraMetaData
- the extra meta data to write in the footer- Throws:
IOException
- if there is an error while writing
-
mergeMetadataFiles
@Deprecated public static org.apache.parquet.hadoop.metadata.ParquetMetadata mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf) throws IOException Deprecated.metadata files are not recommended and will be removed in 2.0.0Given a list of metadata files, merge them into a single ParquetMetadata Requires that the schemas be compatible, and the extraMetadata be exactly equal.- Parameters:
files
- a list of files to merge metadata fromconf
- a configuration- Returns:
- merged parquet metadata for the files
- Throws:
IOException
- if there is an error while writing
-
mergeMetadataFiles
@Deprecated public static org.apache.parquet.hadoop.metadata.ParquetMetadata mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf, org.apache.parquet.hadoop.metadata.KeyValueMetadataMergeStrategy keyValueMetadataMergeStrategy) throws IOException Deprecated.metadata files are not recommended and will be removed in 2.0.0Given a list of metadata files, merge them into a single ParquetMetadata Requires that the schemas be compatible, and the extraMetadata be exactly equal.- Parameters:
files
- a list of files to merge metadata fromconf
- a configurationkeyValueMetadataMergeStrategy
- strategy to merge values for same key, if there are multiple- Returns:
- merged parquet metadata for the files
- Throws:
IOException
- if there is an error while writing
-
writeMergedMetadataFile
@Deprecated public static void writeMergedMetadataFile(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.conf.Configuration conf) throws IOException Deprecated.metadata files are not recommended and will be removed in 2.0.0Given a list of metadata files, merge them into a single metadata file. Requires that the schemas be compatible, and the extraMetaData be exactly equal. This is useful when merging 2 directories of parquet files into a single directory, as long as both directories were written with compatible schemas and equal extraMetaData.- Parameters:
files
- a list of files to merge metadata fromoutputPath
- path to write merged metadata toconf
- a configuration- Throws:
IOException
- if there is an error while reading or writing
-
writeMetadataFile
@Deprecated public static void writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<org.apache.parquet.hadoop.Footer> footers) throws IOException Deprecated.metadata files are not recommended and will be removed in 2.0.0writes a _metadata and _common_metadata file- Parameters:
configuration
- the configuration to use to get the FileSystemoutputPath
- the directory to write the _metadata file tofooters
- the list of footers to merge- Throws:
IOException
- if there is an error while writing
-
writeMetadataFile
@Deprecated public static void writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<org.apache.parquet.hadoop.Footer> footers, org.apache.parquet.hadoop.ParquetOutputFormat.JobSummaryLevel level) throws IOException Deprecated.metadata files are not recommended and will be removed in 2.0.0writes _common_metadata file, and optionally a _metadata file depending on theParquetOutputFormat.JobSummaryLevel
provided- Parameters:
configuration
- the configuration to use to get the FileSystemoutputPath
- the directory to write the _metadata file tofooters
- the list of footers to mergelevel
- level of summary to write- Throws:
IOException
- if there is an error while writing
-
getPos
- Returns:
- the current position in the underlying file
- Throws:
IOException
- if there is an error while getting the current stream's position
-
getNextRowGroupSize
- Throws:
IOException
-