/

/

CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

/

/

CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

CSV Reader Benchmarks: Julia Reads CSVs 10-20x Faster than Python and R

Date Published

Jun 22, 2020

Jun 22, 2020

Contributors

Share

Share

Date Published

Jun 22, 2020

Contributors

Share

The very first task in any data analysis workflow is simply reading the data, and this absolutely must be done quickly and efficiently so the more interesting work can begin. Across many industries and domains, the CSV file format is king for storing and sharing tabular data. Loading CSVs fast and robustly is crucial, and it must scale well across a wide variety of file sizes, data types, and shapes. This post compares the performance for reading 8 different real-world datasets across three different CSV parsers: R’s fread, Pandas’ read_csv, and Julia’s CSV.jl. Each of these was chosen as the “best in class” CSV parser in each R, Python and Julia, respectively.

All three tools have robust support for loading a wide variety of data types with potentially missing values, but only fread (R) and CSV.jl (Julia) support multithreading—Pandas only supports single threaded CSV loading. Julia’s CSV.jl is further unique in that it is the only tool that is fully implemented in its higher-level language rather than being implemented in C and wrapped from R / Python. (Pandas does have a slightly more capable Python-native parser, it is significantly slower and nearly all uses of read_csv default to the C engine.) As such, the CSV.jl benchmarks here not only represent the speed of loading data in Julia, but are also indicative of the sorts of performance that’s possible in the subsequent Julia code used in the analysis.

The following benchmarks show that Julia’s CSV.jl is 1.5 to 5 times faster than Pandas even on a single core; with multithreading enabled, it is as fast or faster than R’s read_csv. The tools used for benchmarking were BenchmarkTools.jl for Julia, microbenchmark for R, and timeit for Python.

Homogenous data

Let’s start with some homogeneous datasets i.e. datasets which have the same kind of data in all columns. The datasets in this section, apart from stock price dataset, are derived from this benchmark site. The performance metric is the time taken to load a dataset as the number of threads is increased from 1 to 20. Since Pandas does not support multi-threading, single threaded speed is reported across the board for all core counts.

Performance on Homogenous Datasets:

Uniform Float dataset:

The first dataset contains float values arranged in 1 Million rows and 20 columns. Pandas takes 232 milliseconds to load this file. Single threaded data.table is 1.6 times faster than CSV.jl. With Multithreading, CSV.jl is at its best, more than double the speed of data.table. CSV.jl is 1.5 times faster than Pandas without multithreading, and about 11 times faster with.

Uniform String dataset(I):

This dataset contains string values in all columns and has 1 Million rows and 20 columns. Pandas takes 546 milliseconds to load the file. With R, adding threads doesn’t seem to lead to any performance gain. Single threaded CSV.jl is 2.5 times faster than data.table. At 10 threads, it is about 14 times faster than data.table.

Uniform String dataset(II):

The dimensions of this dataset are the same as that of the one above. However, every column has missing values as well. Pandas takes 300 milliseconds. Without threading, CSV.jl is 1.2 times faster than R, and with, it is about 5 times faster.

Apple stock prices:

This dataset contains 50 million rows and 5 columns, and is 2.5GB. The rows are open, high, low, and close prices for AAPL stock. The four columns with prices are float values, and there is a date column.

The single threaded CSV.jl is about 1.5 times faster than R’s fread from data.table. With multithreading CSV.jl is about 22 times faster! Pandas’ read_csv takes 34s to read, this is slower than both R and Julia.

Performance on Heterogeneous Datasets

Mixed dataset:

This dataset has 10k rows and 200 columns. The columns contain, String, Float, DateTime, and missing values. Pandas takes about 400 milliseconds to load this dataset. Without threading, CSV.jl is 2 times faster than R, and is about 10 times faster with 10 threads.

Mortgage risk dataset

Now, let’s look at a wider dataset. This mortgage risk dataset from Kaggle is a mixed type dataset, with 356k rows and 2190 columns. The columns are heterogeneous and have values of types String, Int, Float, Missing. Pandas takes 119s to read in this dataset. Single threaded fread is about twice faster than CSV.jl. However, with more threads Julia is either as fast or slightly faster than R.

Wide dataset:

This is a considerably wider dataset with 1000 rows and 20,000 columns. The dataset contains string and Int values. Pandas takes 7.3 seconds to read the dataset. In this case, single threaded data.table is about 5 times faster than CSV.jl. With more threads, CSV.jl is competitive with data.table. Increasing the number of threads doesn’t seem to result in any performance gain in case of data.table.

Fannie Mae Acquisition dataset:

This dataset can be downloaded from Fannie Mae site here. The dataset has 4 Million rows and 25 columns and values of types Int, String, Float, Missing.

Single threaded data.table is 1.25 times faster than CSV.jl. But, the performance of CSV.jl keeps increasing with more threads. CSV.jl gets about 4 times faster with multi-threading.

Summary Charts:

Across all eight datasets, Julia’s CSV.jl is always faster than Pandas, and with multi-threading it is competitive with R’s data.table.

System Info:

The specs of the system on which the benchmarking was performed are as below





Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Contact Us

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.

Contact Us

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.