Package org.apache.drill.exec.physical.impl.scan.framework
package org.apache.drill.exec.physical.impl.scan.framework
Defines the projection, vector continuity and other operations for
a set of one or more readers. Separates the core reader protocol from
the logic of working with batches.
Schema Evolution
Drill discovers schema on the fly. The scan operator hosts multiple readers. In general, each reader may have a distinct schema, though the user typically arranges data in a way that scanned files have a common schema (else SQL is the wrong tool for analysis.) Still, subtle changes can occur: file A is an old version without a new column c, while file B includes the column. And so on.The scan operator works to ensure schema continuity as much as possible, smoothing out "soft" schema changes that are simply artifacts of reading a collection of files. Only "hard" changes (true changes) are passed downstream.
First, let us define three types of schema change:
- Trivial: a change in vectors underlying a column. Column a may be a required integer, but if the actual value vector changes, some operators treat this as a schema change.
- Soft: a change in column order, but not the type or existence of columns. Soft changes cause problems for operators that index columns by position, since column positions change.
- Hard: addition or deletion of a column, or change of column data type.
- Trivial schema changes due to multiple readers. In general, each reader should be independent of other readers, since each file is independent of other files. Since readers are independent, they should have control over the schema (and vectors) used to read the file. However, distinct vectors trigger a trivial schema change.
- Schema changes due to readers that discover data schema as the read progresses. The start of a file might have two columns, a and b. A second batch might discover column c. Then, when reading a second file, the process might repeat. In general, if we have already seen column c in the first file, there is no need to trigger a schema change upon discovering it in the second. (Though there are subtle issues, such as handling required types.)
- Schema changes due to columns that appear in one file, but not another. This is a variation of the above issue, but occurs across files, as explained above.
- Schema changes due to guessing wrong about the type of a missing column. A query might request columns a, b and c. The first file has columns a and b, forcing the scan operator to "make up" a column c. Typically Drill uses a nullable int. But, a later reader might find a column c and realize that it is actually a Varchar.
- Actual schema changes in the data: a file might contain a run of numbers, only to insert a string later. A file A might have columns a and b, while a second file adds column c. (In general, columns are not removed, only added or have the type changed.)
- Level 0: anything that the reader might do, such as a JDBC data source requesting only the required columns.
- Level 1: for queries with an explicit select (SELECT a, b, c...), the result set loader will filter out unwanted columns. So, if file A also includes column d, and file B adds d and f, the result set loader will project out the unneeded columns, avoiding schema change (and unnecessary vector writes.
- Level 2: for multiple readers, or readers with evolving schema, a buffering layer fills in missing columns using the type already seen, if possible.
- Level 3: soft changes are avoided by projecting table columns (and metadata columns) into the order requested by an explicit select.
- Level 4: the scan operator itself monitors the resulting schema, watching for changes that cancel out. For example, each reader builds its own schema. If the two files have an identical schema, then the resulting schemas are identical and no schema change need be advertised downstream.
- Level 1:
LogicalTupleSet
in theResultSetLoader
class. - Level 2:
ProjectionPlanner
,ScanProjection
andScanProjector
. - Level 3:
ScanProjector
. - Level 4:
OperatorRecordBatch.SchemaTracker
.
Selection List Processing
A key challenge in the scan operator is mapping of table schemas to the schema returned by the scan operator. We recognize three distinct kinds of selection:- Wildcard: SELECT *
- Generic: SELECT columns, where "columns" is an array of column values. Supported only by readers that return text columns.
- Explicit: SELECT a, b, c, ... where "a", "b", "c" and so on are the expected names of table columns
A selection list goes through three distinct phases to result in a final schema of the batch returned downstream.
- Query selection planning: resolves column names to metadata (AKA implicit or partition) columns, to "*", to the special "columns" column, or to a list of expected table columns.
- File selection planning: determines the values of metadata columns based on file and/or directory names.
- Table selection planning: determines the fully resolved output schema by resolving the wildcard or selection list against the actual table columns. A late-schema table may do table selection planning multiple times: once each time the schema changes.
- File metadata (filename, fqn, etc.)
- Directory (partition metadata: (dir0, dir1, etc.)
- Table columns (or the special "columns" column)
- Null columns (expected table columns that turned out to not match any actual table column. To avoid errors, Drill returns these as columns filled with nulls.
- Table loader: for the table itself. This loader uses a selection layer to write only the data needed for the output batch, skipping unused columns.
- Metadata loader: to create the file and directory metadata columns.
- Null loader: to populate the null columns.
RowBatchReader
an extremely simple interface for reading data. We would like many developers to create new plugins and readers. The simplified interface pushes all complexity into the scan framework, leaving the reader to just read.ShimBatchReader
an implementation of the above that converts from the simplified API to add additional structure to work with the result set loader. (The base interface is agnostic about how rows are read.)ScheamNegotiator
and interface that allows a batch reader to "negotiate" a schema with the scan framework. The scan framework knows the columns that are to be projected. The reader knows what columns it can offer. The schema negotiator works out how to combine the two. It expresses the result as a result set loader. Column writers are defined for all columns that the reader wants to read, but only the materialized (projected) columns have actual vectors behind them. The non-projected columns are "free-wheeling" "dummy" writers.
And, yes, sorry for the terminology. File "readers" read from files, but
use column "writers" to write to value vectors.
Output Batch Construction
The batches produced by the scan have up to four distinct kinds of columns:Class Structure
Some of the key classes here include:-
ClassDescriptionBasic reader builder for simple non-file readers.ManagedReader<T extends SchemaNegotiator>Extended version of a record reader which uses a size-aware batch mutator.Basic scan framework for a "managed" reader which uses the scan schema mechanisms encapsulated in the scan schema orchestrator.Creates a batch reader on demand.Negotiates the table schema with the scanner framework and provides context information for the reader.Implementation of the schema negotiation between scan operator and batch reader.Represents a layer of row batch reader that works with a result set loader and schema manager to structure the data read by the actual row batch reader.