HOW GREEN IS PRAGUE?

13 min readOct 25, 2020

Authors: Lucie Burdova and Dominika Pipova

Mentors: Zuzana Marvanova and Tereza Kmoskova (Billigence)

We met several years ago when we shared the same cubicle in a corporate and found out that we also share the same sense of humour as well as an interest for data analysis. When we were both accepted to the Digital Academy, it was more than clear that we would be in the same team to work on the DA project. The cooperation only strengthened our friendship (read Dominika likes talking about SQL and Python to her gummy duck aka Lucka).

1) Objective of the project

While looking for the right project, our aim was to have a topic that we both like, ideally related to Prague where we live, and last but not the least — relevant data would be available.

Therefore, we wanted to analyse which Prague district is the most green one. The criteria were following:

1) Distribution and availability of recycling bins

2) Air quality

3) Locations of parks

4) Bioshops and zero waste shops

5) Thermal islands (i.e. difference between air and road temperature)

The goal of our project is to map the recycling bins, locations of parks, bioshops, and zero waste shops so the end-user can easily locate them in his area and orient quickly in case he moves to a new city area. In addition, based on the ecological criteria above, we wanted to assess which city area is the best to live in.

2) Lucy, but where will we get the data?

As per the original plan, we wanted to include more criteria such as number of trees and their impact on street temperature. However, due to data unavailability and constraints, the data is following:

● OpenData Prague source used for:

o Parks: artificial and natural parks

o Recycling bins: their location and details

o Recycling bins with sensors to measure capacity

o Bioshops

● API Golemio for air quality

● API Golemio for meteosensors

● Web scraping thanks to application Apify — our approach was “if we cannot find open data, we will scrape them!”. However, we then found out that even web scraping is not that easy (e.g. getting irrelevant data) as we thought. Thanks to Apify, we got additional data to extend the list of bioshops, zero waste shops, parks, and gardens in Prague.

3) Used languages, modules, functions, and others

● Data extraction via open data source, Apify, API (Python in Visual Studio)

● Python: dictionaries and wonderful nested dictionaries, while and for cycle, if condition. We used module such as JSON, CSV, datetime, or requests.

● We imported the data to Keboola (CSV import or via built Python transformation in Keboola)

● To clean the data, we worked with SQL in Snowflake tool and learned how to cast the data type, joins, group by, union, window functions, and even subselect.

● Tableau tool was used for visualization part. We have to mention how much we are grateful for all help of our mentors because although the visualization might seem as a cherry on the top, we spent dozens of hours to make it just right.

4) Processing datasets

Recycling bins

All datasets are available on OpenData Praha portal managed by the municipal government. As the format was JSON, we transformed first dataset into CSV by Python and got the locations and accessibility of recycling bins. Second dataset contains information about the frequency of waste collection which we also processed in Python. The frequency itself was a code that we had to crack because it said e.g. 14, 17, 71, etc. Thanks to the documentation, we learned that the first digit is week frequency, the second digit translates the frequency within the week. Therefore, we can read number 41 as once in 4 weeks, 17 as 7 times per week. Thanks to SQL, we translated the digits to text so the end-user could read it easily. Afterwards, we left joined the 2 datasets into our first final dataset (TRASHBIN_POSITION_ACCESS).

In order to find out the information about the fullness of the recycling bins, we were able to get the datasets on the same data source. The municipality of Prague is collecting these data since December 2018 as a pilot project for 424 recycling bins located in Prague 1, 3, 5, 6, 7, 8, 9, 10, 16, 18, 20, 22. The fullness is measured 3–5 times per day for each bin. The first dataset contained the information about location and bin type, the second dataset provided us with the information about fullness of the bins. In here, we faced the first trouble with nested dictionaries that we resolved using for cycle in Python. In case the data were not provided for each measurement, we inserted the ‘None’ value in order to avoid bugs. Then we cleaned the data using SQL in Keboola and joined the tables together. The output table is TRASHBIN_SENSOR_MEASUREMENT.

We decided not to join these two output tables as many of the recycling bins were not included in both and so it would be easier for us to work with them separately when working on the visualizations. The whole code can be found here.

Data Model Recycling Bins

Artificial and natural parks

Location of parks and gardens was our next ecological criterion. We had 25 lines on the list of parks located in Prague provided in JSON format by OpenData Praha. It was also our next chance to practice nested dictionaries and we were ready to accept this challenge (and having previous experience gained on recycling bins). The complete Python code for parks is here and for gardens here.

The first problem we faced was the information about location of parks which was provided in so called ‘Křovák’s conic projection’ coordinates and not in GPS coordinates. These specific coordinates are characterized by providing the coordinates of the area rather than separate spots. Unfortunately, the Tableau visualization tool is not compatible with these data and we did not have enough knowledge to climb this barrier. Luckily, our mentor Zuzka helped us to convert these data information into classic GPS coordinates and we could keep moving with our data analysis.

We also figured out that we personally know many other parks that are not in our short dataset. Thanks to lectures we already attended, we decided to use Apify to scrape the data from Google Places (another Zuzka, from Apify, helped us to achieve this). Suddenly, our dataset was three times bigger! Keywords to scrape the data were ‘parks in Prague’ and ‘gardens in Prague’. After that, we had to clean the data by SQL to have only relevant data. We also wanted to have the information about Prague districts in our dataset but we had the full addresses in the dataset, so we had to come up with a code to maintain the district only. Soon after that, we realized that usually each park is spread in more than one district and so we decided not to include the parks into the final analysis. At least we had the chance to practice array_size, array_to_string a split formula. The complete code is here.

Our data tables are 2 open datasets and 2 scraped datasets, which we joined accordingly. Final output tables Parks_Clean and Gardens_Clean were joined by Union in Tableau (could be as well done in Keboola).

Data Model Parks

Bioshops

The list of Bioshops provided by OpenData Praha had 52 lines. Scraping Google Places via Apify, we were able to get additional 82 lines. The searched keywords were ‘bioshops in Prague’ and ‘bezobalu in Prague’. Then we cleaned the data by SQL in Snowflake (Keboola), where we had a chance to practice e.g. getting district from the full address, delete not relevant data outside of Prague or unify the district format as ‘Prague’ + number. In the end, we joined these two datasets by FullOuter. The complete code is here.

Data Model Bioshops

Air quality

Air quality is measured on hourly basis by Czech Hydrometeorological Institute Prague (CHMI) and each measurement is given a mark, similar to Czech school system (1 = best, 5 = worst). In addition, when it is beyond bad, it gives 6. However, such situation happened only in 0.02 % of all measurements since June 2018.

How did we get the data? First, we had to create a token so we can communicate with API. The token was inserted in Requests Header. After MANY times of tries, we got 200 as an answer instead of 400 and we finally got the data! But…we still had to play with while cycle to also get a historical data. Complete Python code is here.

Legenda pro hodinový index a index všech měřených hodnot

We cleaned the data in SQL (full code is stored here), we fought with duplicates which represented almost 30%. We also lost some data due to missing station information as well as irrelevant data.

There are several stations in Prague that we divided into centre and suburbs as defined on the official CHMI site. However, 2 sites must have been incorrectly assigned based on the geographical point, so we intentionally changed it thanks to SQL. In addition, we used meterological data to analyse a potential relation between air pollution and weather. However, no real correlation was found.

Vypočení správné časové zóny plus správné započítání letního a zimního času.

Vypočtení proměnné “hodina” v horní funkci

We got stuck with timestamp which we originally changed already in Python which was not the best idea since the code did not count with the timezone. Therefore, we changed the code in a way to get the full timestamp which we then processed in Tableau.

Datamodel Air_Quality

*The formula was changed on 18 November 2019, details on CHMI webpage. Our analysis works with the former formula.

Thermal Islands

Technical Road Administration of Prague provides with data related to road and air temperature. Thanks to the data we wanted to analyse which area accumulates the most the heat and hence creates the thermal islands.

Since the historical data was not available, we create Python code to scrape the last data from API Golemio and thanks to the Keboola Orchestration, the data is updated every hour. The orchestration started on 5th November so our data is very small to perform a detailed analysis which gives us at least a small insight.

Since we used the same API for Air quality, we applied the same approach for the thermal island part. We cleaned the data in SQL and then we worked with a table “TSK_meteosensors_clean”. The same issue with the timestamp occurred here as well.

Datamodel TSK_meteosensors

5) Outputs, visualizations, and interesting facts

All our visualizations are publicly accessible via Tableau Public.

And what did we find out?

What is the distribution of recycling bins vs. recycling bins with sensors?

We have added the option to filter the recycling bins as per district, street, trash type or accessibility while also having the information about what belongs and does not belong into each recycling bin.

Celkový počet kontejnerů se tříděným odpadem

Which trash type has the highest frequency of waste collection and in which area?

Taking into account only publicly accessible recycling bins, the highest waste collection frequency is for paper and plastic trash. The interesting fact is that in Prague 4 and Prague 12, these two types are only collected less than 2x per week. The least collected recycling bins are also located in Prague 4 and Prague 12 — coloured glass.

How about the bins with sensors?

Thanks to comparing the average fullness and waste collection frequency, we can see a clear picture how these two relate to each other. As an example, coloured glass has the highest fullness where they are collected with the lowest frequency (Prague 8).

TOP 5 fullest and emptiest recycling bins

Below, we have TOP 5 fullest and emptiest recycling bins in the whole Prague. The most interesting fact is that the lowest average values of fullness for paper and plastic are actually around 30–40% in some cases which makes us think that even though the waste collections frequency is high, it’s still insufficient compared to the amount of waste that people produce.

When is the worst air quality?

The worst situation in a long term has been measure at station Legerova, Průmyslová, and Karlín, which are stations situated at intersections with heavy traffic.

Is it worth it to live in suburbs?

Overall, there is not much difference especially in summertime when the quality of air is pretty much the same.

Is it better to live near by an airport or waste incinerator?

At the first sight it seems that it is better to live near by an airport if you look at the overall evaluation. However, with a closer eye at PM 10 measurement (i.e. particulate matter smaller than 10 μm, a human hair is between 50–70 μm) you can see that you would breath a lot less of particulate matter if you choose the waste incinerator location.

When is it better to stay home?

The best air quality was on 13th January 2019 and the worst was on 18th October 2018. In December 2018, there were 3 day which are listed in TOP 5 days, refer to the picture above. In general, the worst situation is around 7 PM.

5 nejlepších a nejhorších dní (počítáno od června 2018 až do listopadu 2019)

Where is the warmest road in Prague?

This analysis uses data only from period between 5–22 November 2019 since there was no historical data so we started scraping it.

At every station beside Lahovice station is the average road temperature always higher than the average air temperature. In Uhřiněves the difference is even more than 4 degrees!

When is the biggest difference between road and air temperature?

There is indeed a difference (in average) during the whole day, however, the gap increases during the daylight and it culminates at noon. This make sense as the black colour of the road attracts the heat.

For a detailed analysis, a bigger dataset would be required especially from summertime when the days are long and the temperature is high. In addition, it would be beneficial to have data about trees in streets to see how big impact shadow and trees can have, weather data about the length of daylight, details if it was sunny or cloudy etc.

Which Prague district is the best?

At the end of our data analysis, we worked with a rank function for each ecological criterion:

- Bio shops — as per the highest number per district

- Recycling bins — as per the lowest ratio of inhabitants (data taken from Czech Statistical Office) versus recycling bins per district

- Air quality — as per the best air quality per district

- Thermal islands — as per the lowest difference between the street and air temperature

Data Model Prague district comparison

While preparing the overall comparison, we had to exclude the district where we did not have the data for each category of our ecological criteria which gave us only 7 districts to compare out of 22 in total. Out of these, Prague 10 scored the best. The overall comparison would better reflect the reality if we had the data for each district and criteria. Our next step would be to dive deep into data from other possible resources and combine them with our data. For the moment, this comparison gives us at least a minor picture about the ecological situation in Prague.

Closure

For both of us, the biggest advantage of working on this project was the possibility to try and practice scripting languages and ETL tools. We had a chance to work with basic scripts in Python and SQL, more advanced formulas, joins, subselects up to transformations, orchestrations, and writers in Keboola. We also managed to create highly interactive dashboard analysis in Tableau that helped us answer many of our questions. Mainly, we had an opportunity to learn a lot during the Digital Academy lectures!

Acknowledgement

Big thanks go to our mentors Zuzka Marvanova and Terka Kmoskova from Billigence and their great help with Tableau, Zuzka Pelechova from Apify and her support with Apify app, to our lectors, coaches, and everyone else who supported us during our DA journey. We would like to also say thank you to each and every Czechita for making this happen. Last but not least, we say thank you one to another for being supportive in the hard times of bug fixing :)

HOW GREEN IS PRAGUE?

Written by Lucka Burdová & Dominika Pípová