Class ScanOperatorExec
- All Implemented Interfaces:
OperatorExec
ScanBatch
and should be used
by all new scan implementations.
The basic concept is to split the scan operator into layers:
- The
OperatorRecordBatch
which implements Drill's Volcano-like protocol. - The scan operator "wrapper" (this class) which implements actions for the operator record batch specifically for scan. It iterates over readers, delegating semantic work to other classes.
- The implementation of per-reader semantics in the two EVF versions and other ad-hoc implementations.
- The result set loader and related classes which pack values into value vectors.
- Value vectors, which store the data.
The layered format can be confusing. However, each layer is somewhat complex, so dividing the work into layers keeps the overall complexity somewhat under control.
Scanner Framework
Acts as an adapter between the operator protocol and the row reader protocol.The scan operator itself is simply a framework for handling a set of readers; it knows nothing other than the interfaces of the components it works with; delegating all knowledge of schemas, projection, reading and the like to implementations of those interfaces. Because that work is complex, a set of frameworks exist to handle most common use cases, but a specialized reader can create a framework or reader from scratch.
Error handling in this class is minimal: the enclosing record batch iterator is responsible for handling exceptions. Error handling relies on the fact that the iterator will call close() regardless of which exceptions are thrown.
Protocol
The scanner works directly with two other interfaces
The ScanOperatorEvents
implementation provides the set of readers to
use. This class can simply maintain a list, or can create the reader on
demand.
More subtly, the factory also handles projection issues and manages vectors across subsequent readers. A number of factories are available for the most common cases. Extend these to implement a version specific to a data source.
The RowBatchReader
is a surprisingly minimal interface that
nonetheless captures the essence of reading a result set as a set of batches.
The factory implementations mentioned above implement this interface to provide
commonly-used services, the most important of which is access to a
{#link ResultSetLoader} to write values into value vectors.
Schema Versions
Readers may change schemas from time to time. To track such changes, this implementation tracks a batch schema version, maintained by comparing one schema with the next.Readers can discover columns as they read data, such as with any JSON-based format. In this case, the row set mutator also provides a schema version, but a fine-grained one that changes each time a column is added.
The two schema versions serve different purposes and are not interchangeable. For example, if a scan reads two files, both will build up their own schemas, each increasing its internal version number as work proceeds. But, at the end of each batch, the schemas may (and, in fact, should) be identical, which is the schema version downstream operators care about.
Empty Files and/or Empty Schemas
A corner case occurs if the input is empty, such as a CSV file that contains no data. The general rule is the following:- If the reader is "early schema" (the schema is defined at open time), then the result will be a single empty batch with the schema defined. Example: a CSV file without headers; in this case, we know the schema is always the single `columns` array.
- If the reader is "late schema" (the schema is defined while the
data is read), then no batch is returned because there is no schema.
Example: a JSON file. It is not helpful to return a single batch
with no columns; such a batch will simply conflict with some other
non-empty-schema batch. It turns out that other DBs handle this
case gracefully: a query of the form
SELECT * FROM VALUES()
Will produce an empty result: no schema, no data. - The hybrid case: the reader could provide an early schema, but cannot do so. That is, the early schema contains no columns. We treat this case identically to the late schema case. Example: a CSV file with headers in which the header line is empty.
-
Field Summary
Modifier and TypeFieldDescriptionprotected final VectorContainerAccessor
protected OperatorContext
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionProvides a generic access mechanism to the batch's output data.void
bind
(OperatorContext context) Bind this operator to the context.boolean
Retrieves the schema of the batch before the first actual batch of data.void
cancel()
Alerts the operator that the query was cancelled.void
close()
Close the operator by releasing all resources that the operator held.context()
boolean
next()
Retrieves the next batch of data.
-
Field Details
-
containerAccessor
-
context
-
-
Constructor Details
-
ScanOperatorExec
-
-
Method Details
-
bind
Description copied from interface:OperatorExec
Bind this operator to the context. The context provides access to per-operator, per-fragment and per-Drillbit services. Also provides access to the operator definition (AKA "pop config") for this operator.- Specified by:
bind
in interfaceOperatorExec
- Parameters:
context
- operator context
-
batchAccessor
Description copied from interface:OperatorExec
Provides a generic access mechanism to the batch's output data. This method is called after a successful return fromOperatorExec.buildSchema()
andOperatorExec.next()
. The batch itself can be held in a standardVectorContainer
, or in some other structure more convenient for this operator.- Specified by:
batchAccessor
in interfaceOperatorExec
- Returns:
- the access for the batch's output container
-
context
-
buildSchema
public boolean buildSchema()Description copied from interface:OperatorExec
Retrieves the schema of the batch before the first actual batch of data. The schema is returned via an empty batch (no rows, only schema) fromOperatorExec.batchAccessor()
.- Specified by:
buildSchema
in interfaceOperatorExec
- Returns:
- true if a schema is available, false if the operator reached EOF before a schema was found
-
next
public boolean next()Description copied from interface:OperatorExec
Retrieves the next batch of data. The data is returned via theOperatorExec.batchAccessor()
method.- Specified by:
next
in interfaceOperatorExec
- Returns:
- true if another batch of data is available, false if EOF was reached and no more data is available
-
cancel
public void cancel()Description copied from interface:OperatorExec
Alerts the operator that the query was cancelled. Generally optional, but allows the operator to realize that a cancellation was requested.- Specified by:
cancel
in interfaceOperatorExec
-
close
public void close()Description copied from interface:OperatorExec
Close the operator by releasing all resources that the operator held. Called afterOperatorExec.cancel()
and afterOperatorExec.batchAccessor()
orOperatorExec.next()
returns false.Note that there may be a significant delay between the last call to next() and the call to close() during which downstream operators do their work. A tidy operator will release resources immediately after EOF to avoid holding onto memory or other resources that could be used by downstream operators.
- Specified by:
close
in interfaceOperatorExec
-