Class ScanLevelProjection
java.lang.Object
org.apache.drill.exec.physical.impl.scan.project.ScanLevelProjection
Parses and analyzes the projection list passed to the scanner. The
scanner accepts a projection list and a plugin-specific set of items
to read. The scan operator produces a series of output batches, which
(in the best case) all have the same schema. Since Drill is "schema
on read", in practice batch schema may evolve. The framework tries
to "smooth" such changes where possible. An output schema adds another
level of stability by specifying the set of columns to project (for
wildcard queries) and the types of those columns (for all queries.)
The projection list is per scan, independent of any tables that the scanner might scan. The projection list is then used as input to the per-table projection planning.
Overview
In most query engines, this kind of projection analysis is done at plan time. But, since Drill is schema-on-read, we don't know the available columns, or their types, until we start scanning a table. The table may provide the schema up-front, or may discover it as the read proceeds. Hence, the job here is to make sense of the project list based on static a-priori information, then to create a list that can be further resolved against an table schema when it appears. This give us two steps:- Scan-level projection: this class, that handles schema for the entire scan operator.
- Table-level projection: defined elsewhere, that merges the table and scan-level projections.
Accepts the inputs needed to plan a projection, builds the mappings, and constructs the projection mapping object.
Builds the per-scan projection plan given a set of projected columns. Determines the output schema, which columns to project from the data source, which are metadata, and so on.
An annoying aspect of SQL is that the projection list (the list of columns to appear in the output) is specified after the SELECT keyword. In Relational theory, projection is about columns, selection is about rows...
Projection Mappings
Mappings can be based on three primary use cases:- SELECT *: Project all data source columns, whatever they happen to be. Create columns using names from the data source. The data source also determines the order of columns within the row.
- SELECT columns: Similar to SELECT * in that it projects all columns from the data source, in data source order. But, rather than creating individual output columns for each data source column, creates a single column which is an array of Varchars which holds the (text form) of each column as an array element.
- SELECT a, b, c, ...: Project a specific set of columns, identified by case-insensitive name. The output row uses the names from the SELECT list, but types from the data source. Columns appear in the row in the order specified by the SELECT. <liSELECT ...: SELECT nothing, occurs in SELECT COUNT(*) type queries. The provided projection list contains no (table) columns, though it may contain metadata columns.
- Wildcard ("*") column: indicates the place in the projection list to insert the table columns once found in the table projection plan.
- Data source columns: columns from the underlying table. The table projection planner will determine if the column exists, or must be filled in with a null column.
- The generic data source columns array: columns, or optionally specific members of the columns array such as columns[1].
- Implicit columns: fqn, filename, filepath and suffix. These reference parts of the name of the file being scanned.
- Partition columns: dir0, dir1, ...: These reference parts of the path name of the file.
Projection with a Schema
The client can provide an output schema that defines the types (and defaults) for the tuple produced by the scan. When a schema is provided, the above use cases are extended as follows:- SELECT * with strict schema: All columns in the output schema are projected, and only those columns. If a reader offers additional columns, those columns are ignored. If the reader omits output columns, the default value (if any) for the column is used.
- SELECT * with a non-strict schema: the output tuple contains all columns from the output schema as explained above. In addition, if the reader provides any columns not in the output schema, those columns are appended to the end of the tuple. (That is, the output schema acts as it it were from an imaginary "0th" reader.)
- Explicit projection: only the requested columns appear, whether from the output schema, the reader, or as nulls.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
static interface
Interface for add-on parsers, avoids the need to create a single, tightly-coupled parser for all types of columns.static enum
Identifies the kind of projection done for this scan. -
Field Summary
Modifier and TypeFieldDescriptionprotected final CustomErrorContext
Context used with error messages.protected boolean
protected List<ColumnProjection>
protected RequestedTuple
Projection definition for the scan a whole.protected List<ScanLevelProjection.ScanProjectionParser>
protected final List<SchemaPath>
protected ScanLevelProjection.ScanProjectionType
protected ProjectionFilter
Projection definition passed to each reader.protected final TupleMetadata
protected boolean
-
Method Summary
Modifier and TypeMethodDescriptionvoid
addMetadataColumn
(ColumnProjection outCol) void
addTableColumn
(ColumnProjection outCol) static ScanLevelProjection
build
(List<SchemaPath> projectionList, List<ScanLevelProjection.ScanProjectionParser> parsers) Builder shortcut, primarily for tests.static ScanLevelProjection
build
(List<SchemaPath> projectionList, List<ScanLevelProjection.ScanProjectionParser> parsers, TupleMetadata outputSchema) Builder shortcut, primarily for tests.static ScanLevelProjection.Builder
builder()
columns()
The entire set of output columns, in output order.context()
boolean
boolean
Returns true if the projection list is empty.boolean
Return whether this is a SELECT * queryReturn the set of columns from the SELECT listtoString()
-
Field Details
-
errorContext
Context used with error messages. -
projectionList
-
readerSchema
-
parsers
-
includesWildcard
protected boolean includesWildcard -
sawWildcard
protected boolean sawWildcard -
outputCols
-
outputProjection
Projection definition for the scan a whole. Parsed form of the input projection list. -
readerProjection
Projection definition passed to each reader. This is the set of columns that the reader is asked to provide. -
projectionType
-
-
Method Details
-
builder
-
build
public static ScanLevelProjection build(List<SchemaPath> projectionList, List<ScanLevelProjection.ScanProjectionParser> parsers) Builder shortcut, primarily for tests. -
build
public static ScanLevelProjection build(List<SchemaPath> projectionList, List<ScanLevelProjection.ScanProjectionParser> parsers, TupleMetadata outputSchema) Builder shortcut, primarily for tests. -
addTableColumn
-
addMetadataColumn
-
context
-
requestedCols
Return the set of columns from the SELECT list- Returns:
- the SELECT list columns, in SELECT list order
-
columns
The entire set of output columns, in output order. Output order is that specified in the SELECT (for an explicit list of columns) or table order (for SELECT * queries).- Returns:
- the set of output columns in output order
-
projectionType
-
projectAll
public boolean projectAll()Return whether this is a SELECT * query- Returns:
- true if this is a SELECT * query
-
isEmptyProjection
public boolean isEmptyProjection()Returns true if the projection list is empty. This usually indicates a SELECT COUNT(*) query (though the scan operator does not have the context to know that an empty list does, in fact, imply a count-only query...)- Returns:
- true if no table columns are projected, false if at least one column is projected (or the query contained the wildcard)
-
rootProjection
-
readerProjection
-
hasReaderSchema
public boolean hasReaderSchema() -
readerSchema
-
toString
-