Class TupleParser
- All Implemented Interfaces:
ElementParser
The structure parser maintains a map of known fields. Each time a field is parsed, looks up the field in the map. If not found, the parser looks ahead to find a value token, if any, and calls this class to add a new column. This class creates a column writer based either on the type provided in a provided schema, or inferred from the JSON token.
As it turns out, most of the semantic action occurs at the tuple level: that is where fields are defined, types inferred, and projection is computed.
Nulls
Much code here deals with null types, especially leading nulls, leading empty arrays, and so on. The object parser creates a parser for each value; a parser which "does the right thing" based on the data type. For example, for a Boolean, the parser recognizestrue
,
false
and null
.
But what happens if the first value for a field is null
? We
don't know what kind of parser to create because we don't have a schema.
Instead, we have to create a temporary placeholder parser that will consume
nulls, waiting for a real type to show itself. Once that type appears, the
null parser can replace itself with the correct form. Each vector's
"fill empties" logic will back-fill the newly created vector with nulls
for prior rows.
Two null parsers are needed: one when we see an empty list, and one for
when we only see null
. The one for {@code null{@code must morph into
the one for empty lists if we see:<br>
{@code {a: null} {a: [ ] }}<br>
<p>
If we get all the way through the batch, but have still not seen a type,
then we have to guess. A prototype type system can tell us, otherwise we
guess {@code VARCHAR}. ({@code VARCHAR} is the right choice for all-text
mode, it is as good a guess as any for other cases.)
<h4>Projection List Hints</h4>
To help, we consult the projection list, if any, for a column. If the
projection is of the form {@code a[0]}, we know the column had better
be an array. Similarly, if the projection list has {@code b.c}, then
{@code b} had better be an object.
<h4>Array Handling</h4>
The code here handles arrays in two ways. JSON normally uses the
{@code LIST} type. But, that can be expensive if lists are
well-behaved. So, the code here also implements arrays using the
classic {@code REPEATED} types. The repeated type option is disabled
by default. It can be enabled, for efficiency, if Drill ever supports
a JSON schema. If an array is well-behaved, mark that column as able
to use a repeated type.
<h4>Ambiguous Types</h4>
JSON nulls are untyped. A run of nulls does not tell us what type will
eventually appear. The best solution is to provide a schema. Without a
schema, the code is forgiving: defers selection of the column type until
the first non-null value (or, forces a type at the end of the batch.)
<p>
For scalars the pattern is: <code>{a: null} {a: "foo"}</code>. Type
selection happens on the value {@code "foo"}.
<p>
For arrays, the pattern is: <code>{a: []} {a: ["foo"]}</code>. Type
selection happens on the first array element. Note that type selection
must happen on the first element, even if tha element is null (which,
as we just said, ambiguous.)
<p>
If we are forced to pick a type (because we hit the end of a batch, or
we see {@code [null]}, then we pick {@code VARCHAR} as we allow any
scalar to be converted to {@code VARCHAR}. This helps for a single-file
query, but not if multiple fragments each make their own (inconsistent)
decisions. Only a schema provides a consistent answer.
-
Field Summary
Fields inherited from class org.apache.drill.exec.store.easy.json.parser.ObjectParser
logger
-
Constructor Summary
ConstructorDescriptionTupleParser
(JsonLoaderImpl loader, TupleWriter tupleWriter, TupleMetadata providedSchema) TupleParser
(JsonStructureParser structParser, JsonLoaderImpl loader, TupleWriter tupleWriter, TupleMetadata providedSchema) -
Method Summary
Modifier and TypeMethodDescriptionprotected FieldFactory
void
void
loader()
onField
(String key, TokenIterator tokenizer) The structure parser has just encountered a new field for this object.protected TupleMetadata
resolveArray
(String key, TokenIterator tokenizer) resolveField
(String key, TokenIterator tokenizer) writer()
Methods inherited from class org.apache.drill.exec.store.easy.json.parser.ObjectParser
fieldParser, onEnd, onStart, parse, replaceFieldParser
Methods inherited from class org.apache.drill.exec.store.easy.json.parser.AbstractElementParser
errorFactory, structParser
-
Constructor Details
-
TupleParser
public TupleParser(JsonStructureParser structParser, JsonLoaderImpl loader, TupleWriter tupleWriter, TupleMetadata providedSchema) -
TupleParser
-
-
Method Details
-
loader
-
writer
-
providedSchema
-
fieldFactory
-
onField
Description copied from class:ObjectParser
The structure parser has just encountered a new field for this object. This method returns a parser for the field, along with an optional listener to handle events within the field. The field typically uses a value parser create by theFieldParserFactory
class. However, special cases (such as Mongo extended types) can create a custom parser.If the field is not projected, the method should return a dummy parser from
FieldParserFactory.ignoredFieldParser()
. The dummy parser will "free-wheel" over whatever values the field contains. (This is one way to avoid structure errors in a JSON file: just ignore them.) Otherwise, the parser will look ahead to guess the field type and will call one of the "add" methods, each of which should return a value listener for the field itself.A normal field will respond to the structure of the JSON file as it appears. The associated value listener receives events for the field value. The value listener may be asked to create additional structure, such as arrays or nested objects.
Parse position:
{ ... field : ^ ?
for a newly-seen field. Constructs a value parser and its listeners by looking ahead some number of tokens to "sniff" the type of the value. For example:foo: <value>
- Field valuefoo: [ <value> ]
- 1D array valuefoo: [ [<value> ] ]
- 2D array value- Etc.
There are two cases in which no type estimation is possible:
foo: null
foo: []
- Specified by:
onField
in classObjectParser
- Parameters:
key
- name of the fieldtokenizer
- an instance of a token iterator- Returns:
- a parser for the newly-created field
-
resolveField
-
resolveArray
-
forceNullResolution
-
forceEmptyArrayResolution
-