org.apache.drill.exec.store.easy.json.loader.TupleParser

All Implemented Interfaces:: ElementParser

public class TupleParser extends ObjectParser

Accepts { name : value ... }

The structure parser maintains a map of known fields. Each time a field is parsed, looks up the field in the map. If not found, the parser looks ahead to find a value token, if any, and calls this class to add a new column. This class creates a column writer based either on the type provided in a provided schema, or inferred from the JSON token.

As it turns out, most of the semantic action occurs at the tuple level: that is where fields are defined, types inferred, and projection is computed.

Nulls

Much code here deals with null types, especially leading nulls, leading empty arrays, and so on. The object parser creates a parser for each value; a parser which "does the right thing" based on the data type. For example, for a Boolean, the parser recognizes true, false and null.

But what happens if the first value for a field is null? We don't know what kind of parser to create because we don't have a schema. Instead, we have to create a temporary placeholder parser that will consume nulls, waiting for a real type to show itself. Once that type appears, the null parser can replace itself with the correct form. Each vector's "fill empties" logic will back-fill the newly created vector with nulls for prior rows.

Two null parsers are needed: one when we see an empty list, and one for when we only see null. The one for invalid input: '{@code null{@code must morph into the one for empty lists if we see:<br> {@code {a: null} {a: [ ] }}<br> <p> If we get all the way through the batch, but have still not seen a type, then we have to guess. A prototype type system can tell us, otherwise we guess {@code VARCHAR}. ({@code VARCHAR} is the right choice for all-text mode, it is as good a guess as any for other cases.) <h4>Projection List Hints</h4> To help, we consult the projection list, if any, for a column. If the projection is of the form {@code a[0]}, we know the column had better be an array. Similarly, if the projection list has {@code b.c}, then {@code b} had better be an object. <h4>Array Handling</h4> The code here handles arrays in two ways. JSON normally uses the {@code LIST} type. But, that can be expensive if lists are well-behaved. So, the code here also implements arrays using the classic {@code REPEATED} types. The repeated type option is disabled by default. It can be enabled, for efficiency, if Drill ever supports a JSON schema. If an array is well-behaved, mark that column as able to use a repeated type. <h4>Ambiguous Types</h4> JSON nulls are untyped. A run of nulls does not tell us what type will eventually appear. The best solution is to provide a schema. Without a schema, the code is forgiving: defers selection of the column type until the first non-null value (or, forces a type at the end of the batch.) <p> For scalars the pattern is: <code>{a: null} {a: "foo"}</code>. Type selection happens on the value {@code "foo"}. <p> For arrays, the pattern is: <code>{a: []} {a: ["foo"]}</code>. Type selection happens on the first array element. Note that type selection must happen on the first element, even if tha element is null (which, as we just said, ambiguous.) <p> If we are forced to pick a type (because we hit the end of a batch, or we see {@code [null]}, then we pick {@code VARCHAR} as we allow any scalar to be converted to {@code VARCHAR}. This helps for a single-file query, but not if multiple fragments each make their own (inconsistent) decisions. Only a schema provides a consistent answer.'

Field Summary

Fields inherited from class org.apache.drill.exec.store.easy.json.parser.ObjectParser
logger
Constructor Summary

Constructors

Constructor

Description

TupleParser(JsonLoaderImpl loader, TupleWriter tupleWriter, TupleMetadata providedSchema)

TupleParser(JsonStructureParser structParser, JsonLoaderImpl loader, TupleWriter tupleWriter, TupleMetadata providedSchema)
Method Summary

Modifier and Type

Method

Description

protected FieldFactory

fieldFactory()

void

forceEmptyArrayResolution(String key)

void

forceNullResolution(String key)

JsonLoaderImpl

loader()

ElementParser

onField(String key, TokenIterator tokenizer)

The structure parser has just encountered a new field for this object.

protected TupleMetadata

providedSchema()

ElementParser

resolveArray(String key, TokenIterator tokenizer)

ElementParser

resolveField(String key, TokenIterator tokenizer)

TupleWriter

writer()

Methods inherited from class org.apache.drill.exec.store.easy.json.parser.ObjectParser
fieldParser, onEnd, onStart, parse, replaceFieldParser

Methods inherited from class org.apache.drill.exec.store.easy.json.parser.AbstractElementParser
errorFactory, structParser

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TupleParser
  
  public TupleParser(JsonStructureParser structParser, JsonLoaderImpl loader, TupleWriter tupleWriter, TupleMetadata providedSchema)
- TupleParser
  
  public TupleParser(JsonLoaderImpl loader, TupleWriter tupleWriter, TupleMetadata providedSchema)
Method Details
- loader
  
  public JsonLoaderImpl loader()
- writer
  
  public TupleWriter writer()
- providedSchema
  
  protected TupleMetadata providedSchema()
- fieldFactory
  
  protected FieldFactory fieldFactory()
- onField
  
  public ElementParser onField(String key, TokenIterator tokenizer)
  
  Description copied from class: ObjectParser
  The structure parser has just encountered a new field for this object. This method returns a parser for the field, along with an optional listener to handle events within the field. The field typically uses a value parser create by the FieldParserFactory class. However, special cases (such as Mongo extended types) can create a custom parser.
  If the field is not projected, the method should return a dummy parser from FieldParserFactory.ignoredFieldParser(). The dummy parser will "free-wheel" over whatever values the field contains. (This is one way to avoid structure errors in a JSON file: just ignore them.) Otherwise, the parser will look ahead to guess the field type and will call one of the "add" methods, each of which should return a value listener for the field itself.
  A normal field will respond to the structure of the JSON file as it appears. The associated value listener receives events for the field value. The value listener may be asked to create additional structure, such as arrays or nested objects.
  Parse position: { ... field : ^ ? for a newly-seen field. Constructs a value parser and its listeners by looking ahead some number of tokens to "sniff" the type of the value. For example:
  
  foo: <value> - Field value
  
  foo: [ <value> ] - 1D array value
  
  foo: [ [<value> ] ] - 2D array value
  
  Etc.
  
  There are two cases in which no type estimation is possible:
  
  foo: null
  
  foo: []
  Specified by:
  
  onField in class ObjectParser
  
  Parameters:
  
  key - name of the field
  
  tokenizer - an instance of a token iterator
  
  Returns:
  
  a parser for the newly-created field
- resolveField
  
  public ElementParser resolveField(String key, TokenIterator tokenizer)
- resolveArray
  
  public ElementParser resolveArray(String key, TokenIterator tokenizer)
- forceNullResolution
  
  public void forceNullResolution(String key)
- forceEmptyArrayResolution
  
  public void forceEmptyArrayResolution(String key)

Class TupleParser

Nulls

Field Summary

Fields inherited from class org.apache.drill.exec.store.easy.json.parser.ObjectParser

Constructor Summary

Method Summary

Methods inherited from class org.apache.drill.exec.store.easy.json.parser.ObjectParser

Methods inherited from class org.apache.drill.exec.store.easy.json.parser.AbstractElementParser

Methods inherited from class java.lang.Object

Constructor Details

TupleParser

TupleParser

Method Details

loader

writer

providedSchema

fieldFactory

onField

resolveField

resolveArray

forceNullResolution

forceEmptyArrayResolution