Thursday, January 23, 2025
No menu items!
HomeDatabase ManagementOutliers, high-entropy and flex columns in JSON-to-Duality

Outliers, high-entropy and flex columns in JSON-to-Duality

Fun fact: by coffee consumption by country in 2024, Finland is #1 in the world.

If you are familiar with JSON and still wondering about the meaning of the words in the title, then that’s fine – these are relatively new concepts related to arguably the most important feature of Oracle Database 23ai: JSON Relational Duality. Migrating from MongoDB or any other document store can be now automated, i.e., creating the relational model with the Duality Views on top along with migrating the data can be done using a new migration tool. Check first “The JSON-To-Duality Migrator in Oracle Database 23ai” if you are not familiar with the new tool and how to use it.

Consider the following collection of 12 documents containing different types of coffee in order to describe with an example the new concepts (most of the data depicts the type of the coffee, its price, size and available quantity; plus a date field). Espresso and Finlandia coffee have some additional fields (Decaf has also a field describing its brand):

Use the following command in mongosh in order to insert/generate the data into a collection called COFFEE:

db.COFFEE.insertMany([
	{ "_id" : 1, "item" : "Espresso", "price" : 5, "size": "Short", "quantity" : 22, "date" : ISODate("2024-01-15T08:00:00Z"),
"CoffeeItems" : [{ "Details"     : { "Description" : "Italian Espresso",
                                "UnitPrice"   : 5,
                                "Code"        : 28995},
                   "Quantity" : 9.0 },
                 { "Details"     : { "Description" : "Finnish Espresso",
                                "UnitPrice"   : 5,
                                "Code"        : 28996},
                   "Quantity" : 11.0 } ] },
	{ "_id" : 2, "item" : "Cappuccino", "price" : 6, "size": "Short","quantity" : 12, "date" : ISODate("2024-08-16T09:00:00Z") },
	{ "_id" : 3, "item" : "Latte", "price" : 10, "size": "Grande","quantity" : 25, "date" : ISODate("2024-02-14T09:05:00Z") },
	{ "_id" : 4, "item" : "Mocha", "price" : 8,"size": "Tall", "quantity" : 11, "date" : ISODate("2024-02-17T08:00:00Z") },
	{ "_id" : 5, "item" : "Americano", "price" : 1, "size": "Grande","quantity" : 12, "date" : ISODate("2024-02-18T21:06:00Z") },
	{ "_id" : 6, "item" : "Cortado", "price" : 7, "size": "Tall","quantity" : 20, "date" : ISODate("2024-02-20T10:07:00Z") },
	{ "_id" : 7, "item" : "Macchiato", "price" : 9,"size": "Tall", "quantity" : 30, "date" : ISODate("2024-02-21T10:08:00Z") },
	{ "_id" : 8, "item" : "Turkish Coffee", "price" : 20, "size": "Grande","quantity" : 21, "date" : ISODate("2024-02-22T14:09:00Z") },
	{ "_id" : 9, "item" : "Ice Coffee", "price" : 15, "size": "Grande","quantity" : 17, "date" : ISODate("2024-08-23T14:09:00Z") },
	{ "_id" : 10, "item" : "Dirty Chai", "price" : 12, "size": "Tall","quantity" : 15, "date" : ISODate("2024-08-25T14:09:00Z") },
	{ "_id" : 11, "item" : "Decaf", "price" : 4, "size": "Normal", "quantity" : 2, "date" : ISODate("2024-01-16T11:01:00Z"), "brand":"Relax" },
	{ "_id" : 12, "item" : "Finlandia", "price" : 50, "size": "Grande","quantity" : 107, "date" : ISODate("2024-08-16T10:00:00Z"),
"CoffeeItems" : [{ "Details"     : { "Description" : "Coffee from Helsinki",
                                "UnitPrice"   : 50,
                                "Code"        : 35801},
                   "Quantity" : 50.0 },
                 { "Details"     : { "Description" : "Coffee from Tampere",
                                "UnitPrice"   : 50,
                                "Code"        : 35802},
                   "Quantity" : 40.0 } ,
                 { "Details"     : { "Description" : "Coffee from Turku",
                                "UnitPrice"   : 50,
                                "Code"        : 35803},
                   "Quantity" : 17.0 }] }
]);

Let us look into the definitions first:

A field is an outlier for a given document set if it occurs, or if any of its values occurs with a given type, in less than minFrequency percent of the documents. The parameter minFrequency is the minimum frequency for a field not to be considered an outlier (high-entropy).

The default minFrequency value is 5, meaning that a field that occurs in less than five percent of a view’s documents, or a field that occurs with a value of some type in less than five percent of a view’s documents, is considered high-entropy.

A duality-view flex column stores (in an underlying table) JSON objects whose fields aren’t predefined: they’re not mapped individually to specific underlying columns. Unrecognized fields of an object in a document you insert or update are automatically added to the flex column for that object’s underlying table (shown below with an example).

Next, let us migrate the JSON collection to Duality, here is how:

DECLARE
  schema_sql clob;
BEGIN
  schema_sql :=
   dbms_json_duality.infer_and_generate_schema(
   json('{"tableNames"    : [ "COFFEE" ],
          "useFlexFields" : true,
          "updatability"  : true,
          "minFrequency"  : 20,
          "sourceSchema"  : "ADMIN"}'));
  dbms_output.put_line('DDL Script: ');
  dbms_output.put_line(schema_sql);
  execute immediate schema_sql;
 dbms_json_duality.import(table_name => 'COFFEE', view_name => 'COFFEE_DUALITY');
END;
/

Note that it’s technically incorrect to speak of a flex column of a table. A flex column is a duality-view column that’s designated as flex — flex for the view.

Now, let us look at the objects that Oracle created after we ran the JSON-to-Duality migrator:

Here is how the data in the JSON Duality View COFFEE_DUALITY looks like from MongoDB Compass:

Now, back to the main topic, here is the ROOT table that was created automativally:

Note the ORA$FLEX column: the fields “Coffe Items” and “brand” are outliers because they exist in less than 20% of the 12 documents. Note also that we set minFrequency to 20 in the JSON-to-Duality Migrator.

If we insert another document with a new field called “color” and update one of the existing documents, we can see that “color” is also an outlier.

insert into COFFEE_DUALITY (data) values ('{ "_id" : 13, "item" : "Ristretto", "price" : 11, "size": "Short","quantity" : 0, "date" : "2024-12-16T10:00:00Z", "color":"dark" }');

update COFFEE_DUALITY cd set cd.data = json_transform(cd.data, set '$.color' = 'light') where cd.data."_id" = 3;

All generated field and column names always have the prefix ORA$. If you pay attention to the column names of the underlying COFFEE_ROOT table, you will probably notice the ORA$DATE and ORA$SIZE columns. Note that both DATE and SIZE are reserved words in Oracle (check page 262).

Any tables underlying a duality view can have several JSON-type columns. But at most one JSON column per table can be designated as a flex column at each position where that table is used in the duality view definition. If a given table is used only at one place in a view definition (a typical case) then only one flex column for the table can be used. If the same table is used in N different places in a view definition, then up to N different flex columns for the table can be designated at those places.

For more information on flex columns, check Flex Columns, Beyond the Basics. For additional information on outliers/high-entropy check the JSON-To-Duality Converter.

Read MoreJulian Dontcheff’s Database Blog

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments