In this recipe, we see the different file formats supported in Sqoop. Sqoop can import data in various file formats like “parquet files” and “sequence files.” Irrespective of the data format in the RDBMS tables, once you specify the required file format in the sqoop import command, the Hadoop MapReduce job, running at the backend, automatically takes care of it. Show
Yelp Dataset Analysis with Spark and Parquet Prerequisites:Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance. Steps to set up the environment:
Importing data in different file formats:Step 1: Log in to MySQL using
Enter the required credentials. And check tables in the database “test.” We used a “flights_info” table from the “test” database as an example and demonstrated this recipe. Pyspark Read Parquet file into DataFramePyspark provides a 4 method in 5 class to read the parquet file into dataframe. Below is an example of a reading parquet file to data frame.
Append or Overwrite an existing Parquet fileUsing append save mode, you can append a dataframe to an existing parquet file. Incase to overwrite use overwrite save mode.
Executing SQL queries DataFramePyspark Sql provides to create temporary views on parquet files for executing sql queries. These views are available until your program exists.
Creating a table on Parquet fileNow let’s walk through executing SQL queries on parquet file. In order to execute sql queries, create a temporary view or table directly on the parquet file instead of creating from DataFrame.
Here, we created a temporary view 2 from “ 3” file. This gives the following results.
Create Parquet partition fileWhen we execute a particular query on the PERSON table, it scan’s through all the rows and returns the results back. This is similar to the traditional database query execution. In PySpark, we can improve query execution in an optimized way by doing partitions on the data using pyspark 4 method. Following is the example of partitionBy().
When you check the people2.parquet file, it has two partitions “gender” followed by “salary” inside. Retrieving from a partitioned Parquet fileThe example below explains of reading partitioned parquet file into DataFrame with gender=M.
Output for the above example is shown below. 0Creating a table on Partitioned Parquet fileHere, I am creating a table on partitioned parquet file and executing a query that executes faster than the table without partition, hence improving the performance. 1Below is the output . 2Complete Example of PySpark read and write Parquet file 3Conclusion:We have learned how to write a Parquet file from a PySpark DataFrame and reading parquet file to DataFrame and created view/tables to execute SQL queries. Also explained how to do partitions on parquet files to improve performance. |