Handling huge(ish) data right in the Browser

 

Due to limitations on the browser and client’s devices, the idea of storing huge amounts of data and executing operations right on the browser was out of place, as it would require too much memory just for storage and leave a minimum amount of memory for data operations and business logic. Web applications were used to read chunks of data from an API, that was connected to a Backend with a specialized database that could handle this amount of data without impacting the performance of the web application.

 

This approach continues to be a valid option, but it lacks the snappy responses that we strive for in a web application as we would have to wait for the backend to create our views, and diagrams or even process huge batches of data before we are able to see updated results in the UI.

 

The definition of “big” data keeps on changing. In the recent past, a terabyte seemed like an unthinkable amount of storage. Now, companies routinely handle thousands of times that amount of data. The power of personal computers has increased in the same way. Browsers routinely handle hundreds of megabytes of data, or more. This has opened up opportunities for data-intensive applications served over the web. With the right bandwidth and front-end technologies, they can provide a rich exploratory experience with data sets that would have been considered “huge” not so long ago.

 

Introducing WebAssembly

 

Thankfully, a new optimized technology has been created in recent years called WebAssembly (WASM for short). WASM is a binary instruction format for a stack-based virtual machine. It was designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications. Although this technology was created to run on the web using a browser, it is also possible to run it on non-web-based environments as in Node.js. For the purpose of this writing, we’ll be focusing on web-based applications using WASM.

 

Two of these new technologies that emerged from WASM are DuckDB and the FINOS’ Perspective. Both of these are open-source technologies and were built with the primary goal of handling vast amounts of data with as much performance as possible. 

 

DuckDB as a browser database

 

DuckDB is a lightweight, in-memory column-oriented database that is optimized for speed. Its column design allows it to benefit from not querying unnecessary columns and being able to apply complex operations over a single data set and type. It is a good choice for applications that require fast data access, such as real-time analytics and visualization. As a database, it allows you to execute SQL queries against the data, which grants a variety of different aggregated results. Not only that, but the DuckDB’s default package provides a robust API to handle and analyze the data inserted.

 

This technology is also available to be installed into Python, R, Java, Node.js, Julia, and C/C++ applications. 

 

DuckDB’s engine allows you to insert data in different formats. Some of these include:

 

  • CSV
  • Parquet
  • JSON
  • S3
  • JSON
  • Excel
  • SQLite
  • PostgreSQL

 

In our experience working with DuckDB on the browser, the formats that we used the most are JSON row-like objects and JSON column-like objects. It also supports Apache Arrow tables right out of the box, and to do so, here are some useful insert statements

 

import { tableFromArrays } from 'apache-arrow';

 

// Create a connection

const c = await db.connect(); 

// Create Apache Arrow table 

const arrowTable = tableFromArrays({

  id: [1, 2, 3],

  name: ['John', 'Jane', 'Jack'],

  age: [20, 21, 22],

});

 

// Data will be inserted as a new table following the 'name'

// parameter. In this example, 'arrow_table` is the name of it

await c.insertArrowTable(arrowTable, { name: 'arrow_table' });

 

// JSON row-like format

const jsonRowContent = [

    { "col1": 1, "col2": "foo" },

    { "col1": 2, "col2": "bar" },

];

 

// Take the JSON object and create a file out of it

await db.registerFileText(

    'rows.json',

    JSON.stringify(jsonRowContent),

);

 

// Create a DB table based on the newly created file

await c.insertJSONFromPath('rows.json', { name: 'rows' });

 

// JSON column-like format

const jsonColContent = {

    "col1": [1, 2],

    "col2": ["foo", "bar"]

};

await db.registerFileText(

    'columns.json',

    JSON.stringify(jsonColContent),

);

await c.insertJSONFromPath('columns.json', { name: 'columns' });

 

// Close the connection to release memory 

await c.close();

 

The column-oriented storage format and WASM-backed calculations allow powerful processing right in the browser. Running operations over tens of thousands of records via DuckDB happens in a flash compared to operating on arrays of JavaScript objects, for example. Our team has taken advantage of this to perform aggregations on large data sets to reduce the burden on charting libraries. We also use it to build tabular displays with custom UI for filtering/sorting. Sorting 50K items in a JS array can take a notable amount of time. DuckDB can regenerate ordered data faster than some data table implementations, even considering marshaling from Arrow to JavaScript objects. For use-cases that want to support arbitrary querying of large datasets in the browser, especially if they are typically operating on a few columns of a wide data set, DuckDB shines.

 

We do not encourage the use of the toString or toArray built-in function of the Apache Arrow tables, as by using these with object-like columns, the parsing would result in more issues and patch fixes than smooth development. This became apparent the more we implemented DuckDB query results with our existing component.

 

Let’s have a look at one of our backend’s column object response:

 

attributes: [

 [

  {

    "key": "my key",

    "value": "my value"

  },

  {

    "key": "another key",

    "value": "another value"

  },

 ],

 [

  {

   "key": "extra key",

   "value": "extra value"

  },

 ...

 ],

...

]

 

This type of compound object is handled by DuckDB automatically as a struct which makes the task of parsing and formatting the object whenever we request or insert new data way easier. The problem with it lies when we want to retrieve the complete column itself instead of just a set of attributes within. By doing this and attempting to call table_results.toString and then parsing it to a JSON object as JSON.parse(), we’d face an angry compiler that says that the JSON structure was incorrect. Workarounds started appearing and we tried to fix these types of issues until enough was enough and we followed another approach to parse Apache Table results into JSON objects.

 

A query that requests for the attributes column like the following one:

 

SELECT attributes FROM our_table

 

Would become a single-column Apache Arrow Table with the example data. Nothing is wrong so far, but the moment we attempted to parse this into a JavaScript object using table.toString, the results become like the following code snippet:

 

'[[{"key":"my\u0000key","value":"my\u0000value"},{"key":"another\u0000key","value":"another\u0000value"}],[{"key":"extra\u0000key","value":"extra\u0000value"}]]'

 

You’ll notice that the whitespaces inside the object were replaced automatically by \u0000 which is the Unicode for null. So we had to remove this character and replace it with simple spaces. But now we have the problem of deciding when can we be sure that we want to replace the Unicode for a single space and not a null literal. Not only that, but the more different types of spaces contained inside the object (tabs, new lines, empty spaces), the more replacement code we need to add.

 

With all of this happening as we continued to add more dependency on DuckDB, we decided to first, whenever we faced struct-like objects to be inserted into the database, transform them into JavaScript stringified objects, and secondly, use getChild on each column that we needed to transform to any non-Apache Arrow format and apply the required operations from there. Afterward, you can and should use the toJSON function that transforms a single column into JavaScript objects right away. 

 

Following our example and expanding to different columns, we can use this code snippet to successfully parse any type of columns from DuckDB:

 

// Object to be returned in row format

 const columnData: Record<string, any[]> = {};

 

// queryResults if the default Apache Arrow table results from DuckDb

  for (const field of queryResults.schema.fields) {

    const columnName = field.name;

    columnData[columnName] = queryResults

      .getChild(columnName)

      ?.toJSON() as any[];

  }

 

  const firstKey = Object.keys(columnData)[0];

 

  // Aux function used to parse attributes from String to actual objects

  const parseAttributes = (

    colData: Record<string, any>,

    attributeKey: string

  ) => {

    if (attributeKey in colData) {

      (colData[attributeKey] as any[]).forEach(

        (value: string, index: number) => {

          colData[attributeKey][index] = JSON.parse(value);

        }

      );

    }

  };

  parseAttributes(columnData, "attributes");

 

 // Transforms column-based to row-based results

  return columnData[firstKey].map((_, index) => {

    const row: Record<string, any> = {};

    Object.keys(columnData).forEach((key) => {

      row[key] = columnData[key][index];

    });

    return row;

  });

 

This approach has the benefit of operating over a single column with multiple values (as in columnar format) and lets us decide which columns should be parsed to row format without the extra overhead of the others, in addition to requiring zero code patches in the parsing process. 

 

For more information on the complete set of capabilities of DuckDB, feel free to check their official documentation here.

 

FINOS’ Perspective: A More Complete Solution

 

The Fintech Open Source Foundation (FINOS from now on) is a community that specializes in creating open-source solutions for financial services. One of these is Perspective, which is an interactive analytics and data visualization component. It is especially well-suited for large and/or streaming datasets. It offers a wider range of features than DuckDB, including support for user-configurable reports, dashboards, notebooks, and applications. 

 

Just like DuckDB, this technology is compiled to WASM, which allows its usage in the browser, but it’s also compiled for Python, so it can be used in conjunction with Python code and/or Jupyterlab. Not only that, but you could power Perspective using a webSocket in either Python or Node.js.

 

In order to communicate with Perspective you’ll have to understand their two main elements: tables and views. Speaking only on the browser side, you need to request a table from a Web Worker. A table is an interface over a single dataset used to stream data into Perspective. A view is a continuous query of a table. You can say that a table contains the data and connects to Perspective, and a view is the current snapshot of the table, this includes filters, grouping by statements, ordering, available columns and so much more. A single table could have potentially multiple views, which allows us to rapidly switch between predefined filter selections that have value to the end user.

 

Here’s an example of a Table’s view, where each key represents a modifiable field in the Perspective table:

 

    tableConfiguration: {

      aggregates: {},

      group_by: [],

      columns: [],

 

      expressions: [],

      filter: [],

      plugin: "Datagrid",

      plugin_config: {

        scroll_lock: false,

        editable: false,

        columns: {},

      },

      split_by: [],

      settings: true,

      sort: [],

    },

 

Although Perspective offers a broader selection of features than DuckDB, it also restricts us in various other ways. For example, we need to define a schema before creating a table and once the table is created, it is immutable. 

 

A table schema is defined as an object where all of the available columns to be shown have their respective data type. This doesn’t include expression columns as they are created by the user and can vary on data type depending on the operations applied to the data.

 

    tableSchema: {

      elapsed: "integer",

      ends: "datetime",

      id: "string",

      label: "string",

      language: "string",

      libraryName: "string",

      libraryVersion: "string",

      name: "string",

      serviceInstanceId: "string",

      serviceName: "string",

      serviceNamespace: "string",

      serviceVersion: "string",

      starts: "datetime",

      statusCode: "integer",

      statusMessage: "string",

    },

 

The data type supported by Perspective’s table to update the data are only JavaScript objects, however, if we want to copy the table data from one table to another, we can export the first table data to arrow format and send that as the input data for the second table.

 

This leads us to another challenge in our use of Perspective, and it is that even with the ability to export the current table data to JSON, columnar, CSV, or Apache Arrow, this process tends to lag the page the more columns and more rows that are current in the table. The reason behind it is that as we only have a single worker for the Perspective table, this worker is in charge of taking the new data, displaying the UI configuration, and exporting to the other components its data. All of these operations are not done in parallel, which results in the table being stuck while the export is in process. This means that If you are planning on using Perspective as the main source of data or database, you must take into consideration that linking the data with other components is most likely going to affect the performance of the view.

 

A fix that the Perspective’s authors provided to us was to create a second worker, that handles the data exporting from Arrow format to JSON, and keep the initial worker as the main responsible of UI rendering and keeping the main source of data. The code to handle this looked something like this:

 

const originalSchema = firstTable.schema()

const currentView = await firstTable.view()

 

// Exporting to arrow instead of JSON is faster

const arrowData = await currentView.to_arrow()

 

// We have successfully copied the UI table data and view into a secondary 

// worker and table

const secondTable = secondWorker.table(orignalSchema).update(arrowData)

const secondView = secondTable.view(currentView)

 

// Return data as any format needed

return await secondView.to_json()

 

Does this mean that DuckDB is a better choice than using Perspective? It depends. The features that Perspective provided out-of-the-box really jump-started the development of our application, but we came to realize that customizing it to fit our desired UI was difficult and felt like fighting its strength instead of playing to it. Programmatically applying filters to the table data, while possible, required a DSL and Perspective APIs that were ultimately more work to keep track of than the SQL we could construct and use against DuckDB. Each technology has its benefits and disadvantages, the goal is to understand them and use them for their best use cases.

 

Combining the best of both worlds

 

Given that each technology offers its own advantages and disadvantages, we decided that instead of dumping or patching all of our existing code for Perspective, we would rather transition from a Perspective-centric storage philosophy, into a DuckDB-centered one, which then can export data to any other component, including Perspective. To achieve this, we kept the existing Perspective’s table schema for our components. Then, whenever we received data from the Backend, we would conduct a series of steps that would transform the data from DuckDB’s format to one that Perspective handles. This would effectively allow us to continue to use all of Perspective’s features, without worrying about Perspective’s current view and exporting inconsistent data to other components that may need a disabled column.

 

In addition to this, by changing the main data source to DuckDB, we could create multiple parallel connections to the DB, that translate into multiple components accessing the required data to display the insights information at the same time. Another improvement that we got from switching sources was that we can now keep all of the extra filters in a single shareable place, that could be applied to SQL queries, instead of having them directly linked to Perspective view filters. We could have created a function that transforms these into a SQL version of them, but that doesn’t remove the overhead caused when exporting data.

 

The flow for using DuckDB and Perspective followed the next structure:

 

// Realize new data has been inserted into DuckDB ...

 

const perspectiveSchema = getPerspectiveSchema()

const queryWithPerspectiveColumns = baseQuery.addColumns(perspectiveSchema.columns)

 

// Data comes as Apache Arrow, so we must parse them to Perspective valid format

const queryResults = await duckDB.query(queryWithPerspectiveColumns)

 

const parsedResults = duckDBResultsToPerspectiveInput(queryResults)

await perspective.updateTableData(parsedResults)

 

// Perspective is now updated with the newest DuckDB data...

 

Another advantage that came from switching to DuckDB was that we can now design as many complex, composable queries as we like, in order to obtain given insight values from the data.  Aggregations within these queries are made under Apache Arrow data format which is much faster than using regular row-based aggregation right on the browser with JavaScript.

 

Obstacles Overcome

 

We would like to share a list of trial-and-error circumstances that we faced during the development cycle of these technologies. This list also includes the solution that we reached for it:

 

  • Exporting data from Perspective to another component must include the required columns for the second component, as the Table’s View configuration doesn’t include any column not present in the view or changes the exported data if any aggregation is applied. For instance, If we deselect the elapsed column, and try to export data to the Insights section (old Insights), the component will not be updated as exporting data from Perspective with the current view doesn’t include that column
    • We validated that the columns are present in the current view, else we freeze the component status until they are available again 
  • Although we can export data into Arrow format from Perspective, the majority of our already developed components did not accept that data type. This means that exporting data from Perspective to be used on another component with a dataset set of over 50K records will lag the UI experience
    • We copied the data from the first table into a second table. This data is passed by arrow format, so it’s faster than JSON. We keep the first table to be UI related component and the second one is in charge of transforming the data to JSON for the rest of the components. This will keep the rendering responsibility to the first worker and the exporting work to the second, reducing the workload and improving performance overall. 
  • DuckDB offers benefits with regards to exporting data, however, because we need to format the incoming data from the Backend to match the current DB schema, we are not able to load as much data as fast as possible (using multiple Promises for example)
    • We decided to use a single thread that requests the highest amount of records before exceeding the limit of data transfer. That way, we can control loading as usual and we can keep the DB schema updated at all times without any unexpected errors happening

 

Key Takeaways

 

  • If you plan on only using Perspective features to display data and you don’t want to burden the user with their Data Grid controls, go ahead and create multiple views ready to be used by the user as in dropdown options.
  • If you don’t need to export the data once it reaches the main source, Perspective works just as well as DuckDB does.
  • If you need to share the data with multiple sources and have equally complex operations applied, DuckDB is a better option.
  • If you have a super user who is willing to learn how to use Perspective, how to create new columns, design different views, and extract as many insights from it as possible, take Perspective and make a good tutorial for it

 

Conclusion

 

DuckDB and Perspective Table are both powerful technologies that can be used to run queries on data stored in the browser. The best choice for you will depend on your specific needs and requirements.

 

We recommend you take your time using each technology, reading the docs and playing around with them, and choose whichever feels more comfortable and fulfills the needs of your web app. Both of them are constantly getting updated and the community surrounding them is incredibly gentle and patient when it comes to questions and doubts.

Our TelemetryHub site has a Perspective table demo attached to the Docs, You can go ahead and play around with it to get a feel for how the table works. We also use both DuckDB and Perspective to represent the Traces data in order to maximize the power of these technologies.