Saturday, May 30, 2020

Reading Excel Data in Node.JS

Building a web app that reads excel data


This is a real world application of a project that has many applications. When I went on to solve this there were not many tutorials or information on how to solve the many complex issues surrounding the real world application. 

Reading the file through file upload

In my application I needed to be able to read the file that a user would upload. My nodejs application was a server side Express API so I needed to build a route that would accept a file upload. I chose to keep the data in memory rather than copy the file to a file server as I was only interested in the data not the actual excel file itself. Once I read all the data from the file I would release it from memory.

Expectations for upload and response

So I had a front end web single page application (built on Vue.JS) that used Dropzone.JS for the file upload handling. Axios was posting the file to the Express.JS Node.JS API. There were a few issues to overcome with regards to user experience. The experience of uploading was good using the Dropzone API as it was already handling file validation and file upload progress so the user could expect where the process was at during the upload and when it was complete. However, this was only the upload process. We also need to read the uploaded file, perform some functions on the data, and store the information in a database. Dropzone.JS was not able to track all of those processes and update the UI for the user. So I needed to make a custom UI that would track those processes in real time for a good user experience.

I decided to use vue-progressbar to create an element that would be used as the progress bar so that the user could see in real-time the progress being made to their process.

User Experience during long processing - Front End

One issue I ran into was the time it took to complete the complex calculations on the data file was exceeding the maximum post response timeout depending on how large the file was.  The purpose of this application was to handle large data sets (40k+ rows) so I decided that I needed to create a new route to track the progress of the file processing. This particular web application did not have web sockets installed and I didn't want to have to set all that up. So I decided to create a simple polling function loop that started after the upload and stopped after the process was complete. There were some issues with this approach but I think this was still a good solution for the time being. 

The first issue was tracking state. I wanted the UI to able to continue updating even if the user went to a different page in the application. It would be bad UX to expect that a user would sit on the page and watch the progress bar move the whole time waiting for a long process so I needed add some state management objects and methods in Vuex

With the state management now tracking and the application polling a route that updates the progress of the processing we had a pretty good user experience that allowed the user to start a long task and continue working on other tasks, while still being able to check on the progress of the long task.

Reading the blob excel file - Back End

The library I chose to use to read the excel file was excel.js. Luckily, it had method of reading the blob data as well as reading a file location. So I created a route in Express that handled the file upload and called a method that would read the file. In my solution, I created a class that would be instantiated on application start in order to track the updates between routes. Express does not have an out-of-the-box method for communication between routes. One issue was after the file was uploaded and the calculation process was started, I was able to update an object but using a separate route to read that object in real time was a challenge. I could have used a global variable, but I don't like that approach. I prefer an object that I could import instead of polluting the entire project with variables that are difficult to keep track of. 

I used a class object that had some methods to start, update and clear a progress property that I could read anywhere in my application. The obvious drawback to this approach was that we could only track a single process at a time due to the class being initialized on application start. If I tried to do 2 processes at the same time they both would be updating the same property on the same object. For this particular application there was only 1 user would be using this function so I decided not to spend the time on building a queue system to handle multiple file processing. This reduced my development time considerably and allowed me to deliver quicker as my client was not concerned with missing functionality.

Forking the process to avoid blocking the event loop

The back-end Express API was also used for many other applications so I needed to make sure that the routes and functions did not block the Node.JS event loop and cause poor performance. For this reason I created a Fork-Process that could read the excel file, perform the calculations, and report the progress back to the API using the process message methods. Using this method created a few more challenges as a forked process has some unique caveats. The forked process was basically a copy of the Express API application running on another thread, however, it was started independently from the original process so that it could not access any of the variables or objects the original process had created during its life cycle. This meant that the database connection pool objects could not be accessed in the new process. Nor could I update the progress property of my custom class directly. 

I needed to create a new connection pool to my database and message system to communicate to the parent process what was going on. 

Looping through the sheets of a workbook

The excel file that I was reading had 3 sheets that had data relevant to my application. I used the methods from the excel.js library to loop through the sheets and then loop through the rows. Before I looped through the rows of the sheet, I needed to validate the header row to make sure the user didn't upload a file that was not compatible with my solution.