Question: What are ways to make dense CSV data more readable?

warren is asking a question about pm: Follow this topic

warren asked on October 03, 2019 18:40
120 views | #21068


In follow-up to the question @stevie posted about cleaning datasets, I had a related one -- when I tried to post a CSV file (essentially a spreadsheet), I found it had too many columns to be very readable. When it graphed, there was just too much information.

I tried deleting a few columns but it wasn't ideal... What are common ways to deal with this?

Referencing https://publiclab.org/questions/stevie/08-23-2019/what-are-best-practices-and-tools-to-help-clean-up-data-sets

image description



6 Comments

Hi @stevie, @Aleah, @Cbarnes9, and @crispinpierce, I posted this in follow up after the purple air datalogs ended up so dense and tough to read! Thanks!

Reply to this comment...


This is very much dataset, project and software dependent.

Sometimes I remove columns, other times I combine them (averaging, adding, etc), other times I split the dataset into "themes" like "all PM data" or "all flow data" or "all temperature data".

Sometimes I use a spreadsheet tool (LibreOffice) but most times I work with a scripting language (R, Python, Octave) where I can easily select which parts of the dataset I want to look at.

What's "universal" is that you need to know (and understand) what each column means and have access to any metadata available so that you can judge which columns are relevant to your question and what might explain certain weirdness in the data.

From the looks of those plots there seems to be several columns that have the same value for most of the dataset (firmware version, macaddress, hardware) while others seem to be "time axes" like "UTCDateTime", "uptime". My recommendation would be to first write down a few very specific questions you want to answer (what's the average PM10 concentration?; what's the range of temperature observed? ... etc) and only look at the parts of the data that help you answer them. I've always found that when I start with a specific question, the data often lead me into the next (more complex) question and I get more proficient in dealing with that particular dataset.

Good luck!

Is this a question? Click here to post it to the Questions page.

Thanks! Yes, i started by removing things like firmware #, for sure. It's interesting to see how some values track each other roughly, while others don't. But I agree it makes sense to start with a question! Thanks @guolivar!


I also noticed that some very high values make others basically unreadable. So having 2 y-axis scales can sometimes help, i guess, although that's not so simple with some graphing software (like google spreadsheets).


Normalising is your friend when you have things with very different ranges. Depending on what you're looking for you can divide each column by its maximum or the mean, or substract the mean from the values, or "properly" normalising making the mean 0 and the standard deviation 1 ... here is a short(ish) page about normalising with Python but the formulas used are also available in all spreadsheet tools. https://machinelearningmastery.com/normalize-standardize-time-series-data-python/


Reply to this comment...


Pivot tables? Lookup or index command? Conditional formatting? all of these things can help. i had to have 10 years and 14 plant plot growth sites of data, where each record was one of 10,000 plants, all in one csv. i colored a lot of columns

Is this a question? Click here to post it to the Questions page.

Reply to this comment...


Log in to comment