org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl

All Implemented Interfaces:: JsonLoader, ErrorFactory

Direct Known Subclasses:: KafkaJsonLoader

public class JsonLoaderImpl extends Object implements JsonLoader, ErrorFactory

Revised JSON loader that is based on the ResultSetLoader abstraction. Uses the listener-based JsonStructureParser to walk the JSON tree in a "streaming" fashion, calling events which this class turns into vector write operations. Listeners handle options such as all text mode vs. type-specific parsing. Think of this implementation as a listener-based recursive-descent parser.

The JSON loader mechanism runs two state machines intertwined:

The actual parser (to parse each JSON object, array or scalar according to its inferred type represented by the JsonStructureParser.
The type discovery machine, which is made complex because JSON may include long runs of nulls, represented by this class.

Schema Discovery

Fields are discovered on the fly. Types are inferred from the first JSON token for a field. Type inference is less than perfect: it cannot handle type changes such as first seeing 10, then 12.5, or first seeing "100", then 200.

When a field first contains null or an empty list, "null deferral" logic adds a special state that "waits" for an actual data type to present itself. This allows the parser to handle a series of nulls, empty arrays, or arrays of nulls (when using lists) at the start of the file. If no type ever appears, the loader forces the field to "text mode", hoping that the field is scalar.

To slightly help the null case, if the projection list shows that a column must be an array or a map, then that information is used to guess the type of a null column.

The code includes a prototype mechanism to provide type hints for columns. At present, it is just used to handle nulls that are never "resolved" by the end of a batch. Would be much better to use the hints (or a full schema) to avoid the huge mass of code needed to handle nulls.

Provided Schema

The JSON loader accepts a provided schema which removes type ambiguities. If we have the examples above (runs of nulls, or shifting types), then the provided schema says the vector type to create; the individual column listeners attempt to convert the JSON token type to the target vector type. The result is that, if the schema provides the correct type, the loader can ride over ambiguities in the input.

Comparison to Original JSON Reader

This class replaces the JsonReader class used in Drill versions 1.17 and before. Compared with the previous version, this implementation:

Materializes parse states as classes rather than as methods and boolean flags as in the prior version.
Reports errors as UserException objects, complete with context information, rather than as generic Java exception as in the prior version.
Moves parse options into a separate JsonLoaderOptions class.
Iteration protocol is simpler: simply call readBatch() until it returns false. Errors are reported out-of-band via an exception.
The result set loader abstraction is perfectly happy with an empty schema. For this reason, this version (unlike the original) does not make up a dummy column if the schema would otherwise be empty.
Projection pushdown is handled by the ResultSetLoader rather than the JSON loader. This class always creates a vector writer, but the result set loader will return a dummy (no-op) writer for non-projected columns.
Like the original version, this version "free wheels" over unprojected objects and arrays; watching only for matching brackets, but ignoring all else.
Writes boolean values as SmallInt values, rather than as bits in the prior version.
This version also "free-wheels" over all unprojected values. If the user finds that they have inconsistent data in some field f, then the user can project fields except f; Drill will ignore the inconsistent values in f.
Because of this free-wheeling capability, this version does not need a "counting" reader; this same reader handles the case in which no fields are projected for SELECT COUNT(*) queries.
Runs of null values result in a "deferred null state" that patiently waits for an actual value token to appear, and only then "realizes" a parse state for that type.
Provides the same limited error recovery as the original version. See DRILL-4653 and DRILL-5953.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

JsonLoaderImpl.JsonLoaderBuilder
Field Summary

Fields inherited from interface org.apache.drill.exec.store.easy.json.loader.JsonLoader
JSON_LITERAL_MODE, JSON_MODE, JSON_TEXT_MODE, JSON_TYPED_MODE
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

JsonLoaderImpl(JsonLoaderImpl.JsonLoaderBuilder builder)
Method Summary

Modifier and Type

Method

Description

void

addNullMarker(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker)

protected UserException

buildError(UserException.Builder builder)

UserException

buildError(ColumnMetadata schema, UserException.Builder builder)

void

close()

Releases resources held by this class including the input stream.

UserException

dataConversionError(ColumnMetadata schema, String tokenType, String value)

protected void

endBatch()

Finish reading a batch of data.

FieldFactory

fieldFactory()

RuntimeException

ioException(IOException e)

I/O error reported from the Jackson JSON parser.

Map<String,Object>

listenerColumnMap()

RuntimeException

messageParseError(MessageParser.MessageContextException e)

Parser is configured to find a message tag within the JSON and a syntax occurred when following the data path.

UserException

nullDisallowedError(ColumnMetadata schema)

JsonLoaderOptions

options()

RuntimeException

parseError(String msg, com.fasterxml.jackson.core.JsonParseException e)

The Jackson JSON parser failed to start on the input file.

JsonStructureParser

parser()

boolean

readBatch()

Read one batch of row data.

void

removeNullMarker(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker)

RuntimeException

structureError(String msg)

General structure-level error: something very unusual occurred in the JSON that passed Jackson, but failed in the structure parser.

RuntimeException

syntaxError(com.fasterxml.jackson.core.JsonParseException e)

The Jackson parser reported a syntax error.

RuntimeException

syntaxError(com.fasterxml.jackson.core.JsonToken token)

Received an unexpected token.

UserException

typeConversionError(ColumnMetadata schema, String tokenType)

UserException

typeConversionError(ColumnMetadata schema, ValueDef valueDef)

RuntimeException

typeError(UnsupportedConversionError e)

The Jackson parser reported an error when trying to convert a value to a specific type.

RuntimeException

unrecoverableError()

Error recovery is on, the structure parser tried to recover, but encountered too many other errors and gave up.

UserException

unsupportedJsonTypeException(String key, ValueDef.JsonType jsonType)

UserException

unsupportedType(ColumnMetadata schema)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- JsonLoaderImpl
  
  protected JsonLoaderImpl(JsonLoaderImpl.JsonLoaderBuilder builder)
Method Details
- options
  
  public JsonLoaderOptions options()
- parser
  
  public JsonStructureParser parser()
- fieldFactory
  
  public FieldFactory fieldFactory()
- listenerColumnMap
  
  public Map<String,Object> listenerColumnMap()
- readBatch
  
  public boolean readBatch()
  
  Description copied from interface: JsonLoader
  
  Read one batch of row data.
  
  Specified by:
  
  readBatch in interface JsonLoader
  
  Returns:
  
  true if at least one record was loaded, false if EOF.
- addNullMarker
  
  public void addNullMarker(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker)
- removeNullMarker
  
  public void removeNullMarker(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker)
- endBatch
  
  protected void endBatch()
  
  Finish reading a batch of data. We may have pending "null" columns: a column for which we've seen only nulls, or an array that has always been empty. The batch needs to finish, and needs a type, but we still don't know the type. Since we must decide on one, we do the following guess Varchar, and switch to text mode.
  This choices is not perfect. Switching to text mode means results will vary from run to run depending on the order that we see empty and non-empty values for this column. Plus, since the system is distributed, the decision made here may conflict with that made in some other fragment.
  The only real solution is for the user to provide a schema.
  Bottom line: the user is responsible for not giving Drill ambiguous data that would require Drill to predict the future.
- close
  
  public void close()
  
  Description copied from interface: JsonLoader
  
  Releases resources held by this class including the input stream. Does not close the result set loader passed into this instance.
  
  Specified by:
  
  close in interface JsonLoader
- parseError
  
  public RuntimeException parseError(String msg, com.fasterxml.jackson.core.JsonParseException e)
  
  Description copied from interface: ErrorFactory
  
  The Jackson JSON parser failed to start on the input file.
  
  Specified by:
  
  parseError in interface ErrorFactory
- ioException
  
  public RuntimeException ioException(IOException e)
  
  Description copied from interface: ErrorFactory
  
  I/O error reported from the Jackson JSON parser.
  
  Specified by:
  
  ioException in interface ErrorFactory
- structureError
  
  public RuntimeException structureError(String msg)
  
  Description copied from interface: ErrorFactory
  
  General structure-level error: something very unusual occurred in the JSON that passed Jackson, but failed in the structure parser. =
  
  Specified by:
  
  structureError in interface ErrorFactory
- syntaxError
  
  public RuntimeException syntaxError(com.fasterxml.jackson.core.JsonParseException e)
  
  Description copied from interface: ErrorFactory
  
  The Jackson parser reported a syntax error. Will not occur if recovery is enabled.
  
  Specified by:
  
  syntaxError in interface ErrorFactory
- typeError
  
  public RuntimeException typeError(UnsupportedConversionError e)
  
  Description copied from interface: ErrorFactory
  
  The Jackson parser reported an error when trying to convert a value to a specific type. Should never occur since we only convert to the type that Jackson itself identified.
  
  Specified by:
  
  typeError in interface ErrorFactory
- syntaxError
  
  public RuntimeException syntaxError(com.fasterxml.jackson.core.JsonToken token)
  
  Description copied from interface: ErrorFactory
  
  Received an unexpected token. Should never occur as the Jackson parser itself catches errors.
  
  Specified by:
  
  syntaxError in interface ErrorFactory
- unrecoverableError
  
  public RuntimeException unrecoverableError()
  
  Description copied from interface: ErrorFactory
  
  Error recovery is on, the structure parser tried to recover, but encountered too many other errors and gave up.
  
  Specified by:
  
  unrecoverableError in interface ErrorFactory
- typeConversionError
  
  public UserException typeConversionError(ColumnMetadata schema, ValueDef valueDef)
- typeConversionError
  
  public UserException typeConversionError(ColumnMetadata schema, String tokenType)
- dataConversionError
  
  public UserException dataConversionError(ColumnMetadata schema, String tokenType, String value)
- nullDisallowedError
  
  public UserException nullDisallowedError(ColumnMetadata schema)
- unsupportedType
  
  public UserException unsupportedType(ColumnMetadata schema)
- unsupportedJsonTypeException
  
  public UserException unsupportedJsonTypeException(String key, ValueDef.JsonType jsonType)
- messageParseError
  
  public RuntimeException messageParseError(MessageParser.MessageContextException e)
  
  Description copied from interface: ErrorFactory
  
  Parser is configured to find a message tag within the JSON and a syntax occurred when following the data path.
  
  Specified by:
  
  messageParseError in interface ErrorFactory
- buildError
  
  public UserException buildError(ColumnMetadata schema, UserException.Builder builder)
- buildError
  
  protected UserException buildError(UserException.Builder builder)

Class JsonLoaderImpl

Schema Discovery

Provided Schema

Comparison to Original JSON Reader

Nested Class Summary

Field Summary

Fields inherited from interface org.apache.drill.exec.store.easy.json.loader.JsonLoader

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

JsonLoaderImpl

Method Details

options

parser

fieldFactory

listenerColumnMap

readBatch

addNullMarker

removeNullMarker

endBatch

close

parseError

ioException

structureError

syntaxError

typeError

syntaxError

unrecoverableError

typeConversionError

typeConversionError

dataConversionError

nullDisallowedError

unsupportedType

unsupportedJsonTypeException

messageParseError

buildError

buildError