![]() Let’s read the CSV data to a PySpark DataFrame and write it out in the Parquet format. Spark is still worth investigating, especially because it’s so powerful for big data sets. Here’s a code snippet, but you’ll need to read the blog post to fully understand it: import dask.dataframe as ddĭf.to_parquet('./tmp/people_parquet2', write_index=False)ĭask is similar to Spark and easier to use for folks with a Python background. Daskĭask is a parallel computing framework that makes it easy to convert a lot of CSV files to Parquet files with a single operation as described in this post. Studying PyArrow will teach you more about Parquet. PyArrow is worth learning because it provides access to file schema and other metadata stored in the Parquet footer. Pq.write_table(table, './tmp/pyarrow_out/people1.parquet') Table = pv.read_csv('./data/people/people1.csv') The code is simple to understand: import pyarrow.csv as pv ![]() PyArrow lets you read a CSV file into a table and write out a Parquet file, as described in this blog post. Pandas leverages the PyArrow library to write Parquet files, but you can also write Parquet files directly from PyArrow. Pandas provides a beautiful Parquet interface. df = pd.read_parquet('tmp/us_presidents.parquet') Let’s read the Parquet data into a Pandas DataFrame and view the results. This code writes out the data to a tmp/us_presidents.parquet file. import pandas as pdĭf = pd.read_csv('data/us_presidents.csv')ĭf.to_parquet('tmp/us_presidents.parquet') You can easily read this file into a Pandas DataFrame and write it out as a Parquet file as described in this Stackoverflow answer. Suppose you have the following data/us_presidents.csv file: full_name,birth_year You can speed up a lot of your Panda DataFrame queries by converting your CSV files and working off of Parquet files.Īll the code used in this blog is in this GitHub repo. Columnar file formats are more efficient for most analytical queries. Parquet is a columnar file format whereas CSV is row based. It discusses the pros and cons of each approach and explains how both approaches can happily coexist in the same ecosystem. This blog post shows how to convert a CSV file to Parquet with Pandas, Spark, PyArrow and Dask. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |