In Fall 2020, along with one other student, I created a data processing pipeline using Python and SQL. The purpose of the project was to write code that would process data from multiple csv files, enter the data into a database design, and analyze the data. "K Pop", a form of pop music originating in South Korea, has seen a massive increase in popularity in recent years. The goal of the project was to measure how this increase relates to factors outside of pop culture and music. Namely, the project examined the following three variables: (1) popularity of K Pop, (2) South Korea’s economic growth, and (3) aptitude in foreign language amongst high school students in the United States.
Popularity of K Pop was measured as the frequency of "korean pop", "k pop", “k-pop”, and "kpop" searches as indicated by Google Trends.
South Korea’s economic growth was measured as gross domestic product (GDP).
Aptitude in foreign language amongst high school students in the United States was measured as average foreign language GPAs.
The pipeline was created such that little to no manipulation of the data files was required. Keywords (e.g., "korean pop", "k pop") were automatically stripped from Google Trends data files and assigned a genre. Country names were automatically stripped from GDP data files. As a result, the pipeline could readily examine associations between the popularity of any music genre (...or any Google Trends data files for that matter) and any country's GDP.