Features:
Eleven Awesome Things You Can Do with csvkit
A Listicle Introduction to the Swiss Army Knife of CSVs
Back in 2009, Aaron Bycoffe inspired Joe Germuska to write a thirty-line Python script called csvcut
for making it easy to select a subset of columns from a CSV file. For the next two years the two of them went back and forth forking and making small revisions to this tool. We got a lot of mileage out of it at the Chicago Tribune, where Joe and I were constantly wrestling with the annoyances of processing datasets efficiently in the command line. In 2011, in a fit of exuberance, I rewrote csvcut
and began building other tools for making this sort of work easier. Thus csvkit was born and has been under continuous development ever since.
Over the last several months there have been two major releases of csvkit. These releases have brought long-awaited features such as Python 3 support, a csvformat
utility and a new csvkit tutorial—not to mention a slew of bug fixes. To celebrate the latest release, here are eleven of my favorite awesome things you can do with csvkit. If you aren’t using it yet, hopefully this will convince you.
1. Ditch Excel (for real)
Who needs it?
in2csv file1.xls > file1.csv
in2csv file2.xlsx > file2.csv
Reference: in2csv
2. Conquer fixed-width formats
Fixed-width files are a particularly frustrating to parse. Save yourself some frustration by using a CSV-formatted schema to convert your fixed-width file into a CSV!
[schema.csv]
column,start,length
name,1,5
age,6,2
cash,8,3
[data.fixed]
Chris44 72
Brian26110
Ryan 18145
Joe 34 83
in2csv -f fixed -s schema.csv data.fixed > data.csv
[data.csv]
name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Be sure to check out FFS for a database of fixed-width schemas for common files, such as those produced by the Bureau of Labor Statistics.
Reference: in2csv
3. Find cells matching a regular expression
[data.csv]
name,phone_number
Chris,555-999-1111
Brian,555-123-4567
Ryan,555-123-8901
Joe,555-777-1111
Find phone numbers following the pattern “ddd–123-dddd”:
csvgrep -c phone_number -r "\d{3}-123-\d{4}" data.csv > matching.csv
[matching.csv]
name,phone_number
Brian,555-123-4567
Ryan,555-123-8901
Reference: csvgrep
4. Turn your data into a JSON lookup table
Do you have data that with a unique id column? Would you like to be able to load that data into your browser keyed by its unique id so that you can use it as a lookup table? Well then.
[geo.csv]
slug,place,latitude,longitude
dcl,Downtown Coffee Lounge,32.35066,-95.30181
tyler-museum,Tyler Museum of Art,32.33396,-95.28174
genecov,Genecov Sculpture,32.299076986939205,-95.31571447849274
csvjson --key slug --indent 4 geo.csv
[keyed.json]
{
"dcl": {
"slug": "dcl",
"place": "Downtown Coffee Lounge",
"latitude": "32.35066",
"longitude": "-95.30181"
},
"tyler-museum": {
"slug": "tyler-museum",
"place": "Tyler Museum of Art",
"latitude": "32.33396",
"longitude": "-95.28174"
},
"genecov": {
"slug": "genecov",
"place": "Genecov Sculpture",
"latitude": "32.299076986939205",
"longitude": "-95.31571447849274"
}
}%
Reference: csvjson
5. Turn a CSV with latitude and longitude columns into GeoJSON
Is your data geographic, like in our previous example? Then don’t stop with JSON—go one step further and make it GeoJSON!
csvjson --lat latitude --lon longitude --key slug --crs EPSG:4269 --indent 4 geo.csv > geo.json
[geo.json]
{
"type": "FeatureCollection",
"bbox": [
-95.31571447849274,
32.299076986939205,
-95.28174,
32.35066
],
"features": [
{
"type": "Feature",
"id": "dcl",
"geometry": {
"type": "Point",
"coordinates": [
-95.30181,
32.35066
]
},
"properties": {
"place": "Downtown Coffee Lounge"
}
},
{
"type": "Feature",
"id": "tyler-museum",
"geometry": {
"type": "Point",
"coordinates": [
-95.28174,
...
Reference: csvjson
6. Generate summary statistics for any CSV file
[data.csv]
name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Ryan,27,300
Jeremy,98,1
$ csvstat data.csv
1. name
<type 'unicode'>
Nulls: False
Values: Chris, Brian, Jeremy, Joe, Ryan
2. age
<type 'int'>
Nulls: False
Min: 18
Max: 98
Sum: 247
Mean: 41.1666666667
Median: 30.5
Standard Deviation: 26.6359197743
Unique values: 6
3. cash
<type 'int'>
Nulls: False
Min: 1
Max: 300
Sum: 711
Mean: 118.5
Median: 96.5
Standard Deviation: 92.1461701139
Unique values: 6
Row count: 6
csvstat
will calculate different statistics based on the type of each column.
Reference: csvstat
7. Execute a SQL query directly on a CSV file
[data.csv]
name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Ryan,27,300
Jeremy,98,1
csvsql --query "select name from data where age > 30" data.csv > old.csv
[old.csv]
name
Chris
Joe
Jeremy
Care to take it up a notch? Reference two CSV files and use SQL to JOIN them (this example uses data from the csvkit repository:
csvsql --query "select m.usda_id, avg(i.sepal_length) as mean_sepal_length from iris as i join irismeta as m on (i.species = m.species) group by m.species" examples/iris.csv examples/irismeta.csv
usda_id,mean_sepal_length
IRSE,5.006
IRVE2,5.936
IRVI,6.588
(Second example courtesy Jeroen Janssens.)
Reference: csvsql
8. Automatically create a SQL table and import a CSV into a database
[data.csv]
name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Ryan,27,300
Jeremy,98,1
createdb demographics
csvsql --db postgresql:///demographics --insert data.csv
Too good to be true? Check it out:
psql -q demographics -c "\d data"
Table "public.data"
Column | Type | Modifiers
--------+----------------------+-----------
name | character varying(6) | not null
age | integer | not null
cash | integer | not null
psql -q demographics -c "select * from data"
name | age | cash
--------+-----+------
Chris | 44 | 72
Brian | 26 | 110
Ryan | 18 | 145
Joe | 34 | 83
Ryan | 27 | 300
Jeremy | 98 | 1
(6 rows)
Reference: csvsql
9. Extract a table from a SQL database into a CSV
sql2csv --db postgresql:///demographics --query "select * from data" > extract.csv
[extract.csv]
name,age,cash
Chris,44,72
Brian,26,110
Ryan,18,145
Joe,34,83
Ryan,27,300
Jeremy,98,1
Reference: sql2csv
10. Turn your Github issues into a CSV
curl https://api.github.com/repos/onyxfish/csvkit/issues?state=open | in2csv -f json > issues.csv
[issues.csv]
url,labels_url,comments_url,events_url,html_url,id,number,title,user,labels,state,locked,assignee,milestone,comments,created_at,updated_at,closed_at,pull_request,body
https://api.github.com/repos/onyxfish/csvkit/issues/329,https://api.github.com/repos/onyxfish/csvkit/issues/329/labels{/name},https://api.github.com/repos/onyxfish/csvkit/issues/329/comments,https://api.github.com/repos/onyxfish/csvkit/issues/329/events,https://github.com/onyxfish/csvkit/pull/329,42344699,329,Adding support for multiline json documents,,,open,False,,,0,2014-09-09T20:03:02Z,2014-09-09T20:03:02Z,,,Fixes #275
https://api.github.com/repos/onyxfish/csvkit/issues/328,https://api.github.com/repos/onyxfish/csvkit/issues/328/labels{/name},https://api.github.com/repos/onyxfish/csvkit/issues/328/comments,https://api.github.com/repos/onyxfish/csvkit/issues/328/events,https://github.com/onyxfish/csvkit/issues/328,42259481,328,Docs should lead-off with an example of why it's awesome,,,open,False,,,0,2014-09-09T00:36:58Z,2014-09-09T00:36:58Z,,,
...
Reference: in2csv
11. Slice a 9,655 column file using index ranges
Perhaps the most audacious use of csvkit I’ve seen:
Slicing 9,655 Illinois schools columns. All praises to csvkit. pic.twitter.com/4DCD2VgYSC
— Dan Hill (@DanHillReports) May 28, 2013
(Credit to Dan Hill.)
Reference: csvcut
What awesome thing are you going to do with csvkit?
I like to think of csvkit as a “proper” open source project, which is to say, it is what its users have made of it. These examples didn’t come straight out of my brain—they are the product of use, suggestions and feedback from hundreds of users. So in closing, I’d like to say thanks to the 46 other people who have so far contributed code to csvkit. The next planned release of csvkit will be the fabled 1.0
. I hope you will all join me by filing or fixing a bug. If you’ve never contributed to an open source project before, this is a great space to start. Join us in bringing some sanity to the world of command-line data processing.
People
Credits
-
Christopher Groskopf
Christopher Groskopf is a data journalist on the NPR Visuals team. In November he’ll be joining Quartz. He divides his time between Tyler, TX and Washington, DC. Find him on Twitter: @onyxfish.