Class JsonLoaderImpl
- All Implemented Interfaces:
JsonLoader
,ErrorFactory
- Direct Known Subclasses:
KafkaJsonLoader
ResultSetLoader
abstraction. Uses the listener-based
JsonStructureParser
to walk the JSON tree in a "streaming"
fashion, calling events which this class turns into vector write
operations. Listeners handle options such as all text mode
vs. type-specific parsing. Think of this implementation as a
listener-based recursive-descent parser.
The JSON loader mechanism runs two state machines intertwined:
- The actual parser (to parse each JSON object, array or scalar according
to its inferred type represented by the
JsonStructureParser
. - The type discovery machine, which is made complex because JSON may include long runs of nulls, represented by this class.
Schema Discovery
Fields are discovered on the fly. Types are inferred from the first JSON token for a field. Type inference is less than perfect: it cannot handle type changes such as first seeing 10, then 12.5, or first seeing "100", then 200.When a field first contains null or an empty list, "null deferral" logic adds a special state that "waits" for an actual data type to present itself. This allows the parser to handle a series of nulls, empty arrays, or arrays of nulls (when using lists) at the start of the file. If no type ever appears, the loader forces the field to "text mode", hoping that the field is scalar.
To slightly help the null case, if the projection list shows that a column must be an array or a map, then that information is used to guess the type of a null column.
The code includes a prototype mechanism to provide type hints for columns. At present, it is just used to handle nulls that are never "resolved" by the end of a batch. Would be much better to use the hints (or a full schema) to avoid the huge mass of code needed to handle nulls.
Provided Schema
The JSON loader accepts a provided schema which removes type ambiguities. If we have the examples above (runs of nulls, or shifting types), then the provided schema says the vector type to create; the individual column listeners attempt to convert the JSON token type to the target vector type. The result is that, if the schema provides the correct type, the loader can ride over ambiguities in the input.Comparison to Original JSON Reader
This class replaces theJsonReader
class used in Drill versions 1.17
and before. Compared with the previous version, this implementation:
- Materializes parse states as classes rather than as methods and boolean flags as in the prior version.
- Reports errors as
UserException
objects, complete with context information, rather than as generic Java exception as in the prior version. - Moves parse options into a separate
JsonLoaderOptions
class. - Iteration protocol is simpler: simply call
readBatch()
until it returnsfalse
. Errors are reported out-of-band via an exception. - The result set loader abstraction is perfectly happy with an empty schema. For this reason, this version (unlike the original) does not make up a dummy column if the schema would otherwise be empty.
- Projection pushdown is handled by the
ResultSetLoader
rather than the JSON loader. This class always creates a vector writer, but the result set loader will return a dummy (no-op) writer for non-projected columns. - Like the original version, this version "free wheels" over unprojected objects and arrays; watching only for matching brackets, but ignoring all else.
- Writes boolean values as SmallInt values, rather than as bits in the prior version.
- This version also "free-wheels" over all unprojected values. If the user finds that they have inconsistent data in some field f, then the user can project fields except f; Drill will ignore the inconsistent values in f.
- Because of this free-wheeling capability, this version does not need a
"counting" reader; this same reader handles the case in which no fields are
projected for
SELECT COUNT(*)
queries. - Runs of null values result in a "deferred null state" that patiently waits for an actual value token to appear, and only then "realizes" a parse state for that type.
- Provides the same limited error recovery as the original version. See DRILL-4653 and DRILL-5953.
-
Nested Class Summary
-
Field Summary
Fields inherited from interface org.apache.drill.exec.store.easy.json.loader.JsonLoader
JSON_LITERAL_MODE, JSON_MODE, JSON_TEXT_MODE, JSON_TYPED_MODE
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
addNullMarker
(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker) protected UserException
buildError
(UserException.Builder builder) buildError
(ColumnMetadata schema, UserException.Builder builder) void
close()
Releases resources held by this class including the input stream.dataConversionError
(ColumnMetadata schema, String tokenType, String value) protected void
endBatch()
Finish reading a batch of data.I/O error reported from the Jackson JSON parser.Parser is configured to find a message tag within the JSON and a syntax occurred when following the data path.nullDisallowedError
(ColumnMetadata schema) options()
parseError
(String msg, com.fasterxml.jackson.core.JsonParseException e) The Jackson JSON parser failed to start on the input file.parser()
boolean
Read one batch of row data.void
removeNullMarker
(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker) structureError
(String msg) General structure-level error: something very unusual occurred in the JSON that passed Jackson, but failed in the structure parser.syntaxError
(com.fasterxml.jackson.core.JsonParseException e) The Jackson parser reported a syntax error.syntaxError
(com.fasterxml.jackson.core.JsonToken token) Received an unexpected token.typeConversionError
(ColumnMetadata schema, String tokenType) typeConversionError
(ColumnMetadata schema, ValueDef valueDef) The Jackson parser reported an error when trying to convert a value to a specific type.Error recovery is on, the structure parser tried to recover, but encountered too many other errors and gave up.unsupportedJsonTypeException
(String key, ValueDef.JsonType jsonType) unsupportedType
(ColumnMetadata schema)
-
Constructor Details
-
JsonLoaderImpl
-
-
Method Details
-
options
-
parser
-
fieldFactory
-
listenerColumnMap
-
readBatch
public boolean readBatch()Description copied from interface:JsonLoader
Read one batch of row data.- Specified by:
readBatch
in interfaceJsonLoader
- Returns:
true
if at least one record was loaded,false
if EOF.
-
addNullMarker
public void addNullMarker(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker) -
removeNullMarker
public void removeNullMarker(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker) -
endBatch
protected void endBatch()Finish reading a batch of data. We may have pending "null" columns: a column for which we've seen only nulls, or an array that has always been empty. The batch needs to finish, and needs a type, but we still don't know the type. Since we must decide on one, we do the following guess Varchar, and switch to text mode.This choices is not perfect. Switching to text mode means results will vary from run to run depending on the order that we see empty and non-empty values for this column. Plus, since the system is distributed, the decision made here may conflict with that made in some other fragment.
The only real solution is for the user to provide a schema.
Bottom line: the user is responsible for not giving Drill ambiguous data that would require Drill to predict the future.
-
close
public void close()Description copied from interface:JsonLoader
Releases resources held by this class including the input stream. Does not close the result set loader passed into this instance.- Specified by:
close
in interfaceJsonLoader
-
parseError
Description copied from interface:ErrorFactory
The Jackson JSON parser failed to start on the input file.- Specified by:
parseError
in interfaceErrorFactory
-
ioException
Description copied from interface:ErrorFactory
I/O error reported from the Jackson JSON parser.- Specified by:
ioException
in interfaceErrorFactory
-
structureError
Description copied from interface:ErrorFactory
General structure-level error: something very unusual occurred in the JSON that passed Jackson, but failed in the structure parser. =- Specified by:
structureError
in interfaceErrorFactory
-
syntaxError
Description copied from interface:ErrorFactory
The Jackson parser reported a syntax error. Will not occur if recovery is enabled.- Specified by:
syntaxError
in interfaceErrorFactory
-
typeError
Description copied from interface:ErrorFactory
The Jackson parser reported an error when trying to convert a value to a specific type. Should never occur since we only convert to the type that Jackson itself identified.- Specified by:
typeError
in interfaceErrorFactory
-
syntaxError
Description copied from interface:ErrorFactory
Received an unexpected token. Should never occur as the Jackson parser itself catches errors.- Specified by:
syntaxError
in interfaceErrorFactory
-
unrecoverableError
Description copied from interface:ErrorFactory
Error recovery is on, the structure parser tried to recover, but encountered too many other errors and gave up.- Specified by:
unrecoverableError
in interfaceErrorFactory
-
typeConversionError
-
typeConversionError
-
dataConversionError
-
nullDisallowedError
-
unsupportedType
-
unsupportedJsonTypeException
-
messageParseError
Description copied from interface:ErrorFactory
Parser is configured to find a message tag within the JSON and a syntax occurred when following the data path.- Specified by:
messageParseError
in interfaceErrorFactory
-
buildError
-
buildError
-