[LRUG] Reading a "large" spreadsheet

Andrew Stewart boss at airbladesoftware.com
Fri Mar 15 06:40:37 PDT 2013


Hello El Rug,

In the past I have successfully used roo [1] to read xlsx spreadsheets.  However I have run into a stumbling block with the latest spreadsheet I need to read – it takes tens of minutes or more, rather than a few seconds.

    require 'rubygems'
    require 'roo'
    s = Roo::Excelx.new 'spreadsheet.xlsx'

When I open the spreadsheet in Numbers I can see there are ~900 rows and ~40 columns in the first worksheet (the other two worksheets have next to nothing).  On disc that worksheet is 53MB of XML.  roo uses Nokogiri to read the XML and sure enough when I try to open the worksheet XML file directly with Nokogiri it takes ~1.5min – still a long time.

    require 'rubygems'
    require 'nokogiri'
    doc = Nokogiri::XML File.open('spreadsheet.xlsx')

The one time I did manage to read the spreadsheet with roo, it told me there were ~900,000 rows.  Presumably 899,000 are empty.  And presumably this is the problem.

Unfortunately I cannot make the spreadsheet available because it's confidential to a customer.

How can I read the 900 actual rows in a few seconds?  (I'm going to ask my customer if they can somehow save their Excel file differently but in the meantime...)  I'd prefer a pure-Ruby solution but at the end of the day I just need to get it read so I don't mind calling out to something else and getting the results back as, say, CSV.

Any suggestions?

Many thanks in advance!

Cheers,
Andy Stewart

[1] https://github.com/Empact/roo




More information about the Chat mailing list