London's finance industry is based in the City of London and Canary Wharf, the two major business districts in London. London is one of the preeminent financial centers of the world as the most important location for international finance. We are going to discuss what are the best venues which we can invest in London, we present an analysis made in phyton and we show step by step how we can determine the top 10 venues for the best 3 Boroughs based on the demographic data, unemployment rate statistics which it is suggested to invest in London .
The content of this work has five sections:
- Getting the the data
- Data wrangling
- Methodology
- Data analysis
- Results and Discussion
London is one of the most attractive cities to do business in the world. It is the capital of both England and U.K. In this notebook we shows how we can study data from UK to understand how and where to invest the money with an small business.
# library for BeautifulSoup, for web scrapping
from bs4 import BeautifulSoup
# library to handle data in a vectorized manner
import numpy as np
# library for data analsysis
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# library to handle JSON files
import json
print('numpy, pandas, ..., imported...')
!pip -q install geopy
print('geopy installed...')
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
print('Nominatim imported...')
# library to handle requests
import requests
print('requests imported...')
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize
print('json_normalize imported...')
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
print('matplotlib imported...')
# import k-means from clustering stage
from sklearn.cluster import KMeans
print('Kmeans imported...')
# install the Geocoder
!pip -q install geocoder
import geocoder
# import time
import time
!pip -q install folium
print('folium installed...')
import folium # map rendering library
print('folium imported...')
from pandas import ExcelWriter
from pandas import ExcelFile
print('...Done')
import warnings
warnings.filterwarnings('ignore')
1.2 Selecting the data¶
The unemployment ratio is the proportion of the working-age population that is unemployed, and in the unemployment ratio by borough in London is showed in the following figure, that is obtained by the following commands:
unemployment_ratio_df = pd.read_csv("unemployment_ratio.csv")
unemployment_ratio_df2=unemployment_ratio_df.dropna()
unemployment_ratio_df2.rename({'Unnamed: 1': '2001', 'Unnamed: 2': '2011'}, axis=1, inplace=True)
unemployment_ratio_df2.reset_index(drop=True, inplace=True)
unemployment_ratio_df2['2011'] = unemployment_ratio_df2['2011'].str[:-1].astype(float)
unemployment_ratio_df2['2001'] = unemployment_ratio_df2['2001'].str[:-1].astype(float)
unemployment_ratio_df3=unemployment_ratio_df2.sort_values(by='2011', ascending=True)
unemployment_ratio_df3.plot(x="Unemployment ratio by borough", y=["2001", "2011"], kind="bar")
This graph shows that the ratio has come down significantly in almost all London boroughs in a relatively short timescale.¶
Considering the lowest unemployment ratio by borough we have the following table:
unemployment_ratio_df3.head()
The top 3 boroughs with lowest unemployment ratio in 2011 are :
- Hammersmith and Fulham
- Wandsworth
- Kingston upon Thames
Now if we compare with the change of Change in unemployment ratio 2011-13 to 2014-1
change_unemployment_ratio_df = pd.read_csv("change_unemployment_ratio.csv")
change_unemployment_ratio_df=change_unemployment_ratio_df.dropna()
change_unemployment_ratio_df.head()
change_unemployment_ratio_df.rename({'Change in unemployment ratio 2011-13 to 2014-16': 'Borough', 'Unnamed: 1': 'Percentual'}, axis=1, inplace=True)
change_unemployment_ratio_df['Percentual'] = change_unemployment_ratio_df['Percentual'].str[:-1].astype(float)
change_unemployment_ratio_df2=change_unemployment_ratio_df.sort_values(by='Percentual', ascending=True)
change_unemployment_ratio_df2.plot(kind='bar',x='Borough',y='Percentual')
change_unemployment_ratio_df2.head(10)
And the highest change unemployment ratio are for the Boroughs:
- Newham
- Croydon
- Ealing
But this fact does not guarantee that will be a good place to invest. So in ordering to be sure about which Borough should be the best to invest,we consider the average percentual, this can be a good indication to determine which Borough we should choose.
change_unemployment_ratio_df2["Percentual"].mean()
therefore arround -1.98% the Borough shold be stable.
- Hammersmith and Fulham has -2.200%
- Haringey -2.134%
- Waltham Forest -2.036%
We have those three candidates, but due to Hammersmith and Fulham has the lowest Unemployment , a good candidate to invest is Hammersmith and Fulham which is located in the South West and North West of London.
Thus we are going to explore what are the possible small bussiness what have a trend over the South West of London.
One of the strongest regions and cities on the planet is London. It’s highly attractive, and it offers one of the best prospects for business development and innovative thinking. London is the showcase for our work throughout Europe and is paramount in our global strategy.
2. Data wrangling¶
Data wrangling sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
In this section we are going to extract more data from different sources of London and transform them into dataframes in ordering to perfom the analytics.
2.1 Exploration of the city of London¶
In order to transform data from websites into data which we can process we requiere use a great tool called BeautifulSoup, we load the libraries and we get the List of areas of London.
# library for BeautifulSoup
from bs4 import BeautifulSoup
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'
wikipedia_page = requests.get(wikipedia_link)
# Cleans html file
soup = BeautifulSoup(wikipedia_page.content, 'html.parser')
# This extracts the "tbody" within the table where class is "wikitable sortable"
table = soup.find('table', {'class':'wikitable sortable'}).tbody
# Extracts all "tr" (table rows) within the table above
rows = table.find_all('tr')
# Extracts the column headers, removes and replaces possible '\n' with space for the "th" tag
columns = [i.text.replace('\n', '')
for i in rows[0].find_all('th')]
# Converts columns to pd dataframe
df = pd.DataFrame(columns = columns)
'''
Extracts every row with corresponding columns then appends the values to the create pd dataframe "df". The first row (row[0]) is skipped because it is already the header
'''
for i in range(1, len(rows)):
tds = rows[i].find_all('td')
if len(tds) == 7:
values = [tds[0].text, tds[1].text, tds[2].text.replace('\n', ''.replace('\xa0','')), tds[3].text, tds[4].text.replace('\n', ''.replace('\xa0','')), tds[5].text.replace('\n', ''.replace('\xa0','')), tds[6].text.replace('\n', ''.replace('\xa0',''))]
else:
values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
df = df.append(pd.Series(values, index = columns), ignore_index = True)
df
df.head(5)
df.columns = ['Location', 'Borough', 'Post-town', 'Postcode',
'Dial-code', 'OSgridref']
df.columns
df.head()
We have to perfomr some cleaning with the following command
# Remove Borough reference numbers with []
df['Borough'] = df['Borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df.head()
We are going now to do some assumptions to reduce the amount of data that we want to process.
The postcodes are spread to multi-rows and assigned the same values from the other columns.
df0 = df.drop('Postcode', axis=1).join(df['Postcode'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('Postcode'))
df0.head()
From the data, only the ‘Location’, ‘Borough’, ‘Postcode’, ‘Post-town’ will be used.
df1 = df0[['Location', 'Borough', 'Postcode', 'Post-town']].reset_index(drop=True)
df1.head()
Now, only the Boroughs with London Post-town will be used for our search of location. Therefore, all the non-post-town are dropped.
df2 = df1 # assigns df1 to df2
df21 = df2[df2['Post-town'].str.contains('LONDON')]
df21.shape
df3 = df21[['Location', 'Borough', 'Postcode']].reset_index(drop=True)
df3.head()
Due to the study of the Bouroughts, Hammersmith and Fulham are in the North West and South West areas of London. For this project the South West will be considered for our analysis. The south west areas has postcodes starting with SW
df_london = df3 # re-assigns to df_london
# Strips whitespaces before postcode
df_london.Postcode = df_london.Postcode.str.strip()
# New dataframe for South East London postcodes - df_se
df_sw = df_london[df_london['Postcode'].str.startswith(('SW'))].reset_index(drop=True)
df_sw.head(10)
We are intterested to see the demography of London for white people.
demograph_link = 'https://en.wikipedia.org/wiki/Demography_of_London'
demograph_page = requests.get(demograph_link)
soup1 = BeautifulSoup(demograph_page.content, 'html.parser')
table1 = soup1.find('table', {'class':'wikitable sortable'}).tbody
rows1 = table1.find_all('tr')
columns1 = [i.text.replace('\n', '')
for i in rows1[0].find_all('th')]
demo_london = pd.DataFrame(columns = columns1)
for j in range(1, len(rows1)):
tds1 = rows1[j].find_all('td')
if len(tds1) == 7:
values1 = [tds1[0].text, tds1[1].text, tds1[2].text.replace('\n', ''.replace('\xa0','')), tds1[3].text, tds1[4].text.replace('\n', ''.replace('\xa0','')), tds1[5].text.replace('\n', ''.replace('\xa0',''))]
else:
values1 = [td1.text.replace('\n', '').replace('\xa0','') for td1 in tds1]
demo_london = demo_london.append(pd.Series(values1, index = columns1), ignore_index = True)
demo_london
#converting string to float
demo_london['White'] = demo_london['White'].astype('float')
demo_london_sorted = demo_london.sort_values(by='White', ascending = False)
demo_london_sorted.head(10)
demo_london_sorted["White"].mean()
Considering the the top 6 areas with higher of the average white people around > 61.58% we have the following list of Boroughs:
- Hammersmith and Fulham 68.1%
- Camden 66.3%
- Merton 64.9%
- Barnet 64.1%
- Greenwich 62.5%
- Westminster 61.7%
df_sw_top = df_sw[df_sw['Borough'].isin(['Hammersmith and Fulham','Camden', 'Merton', 'Barnet','Greenwich', 'Westminster'])].reset_index(drop=True)
df_sw_top
df_sw_top.shape
# Geocoder starts here
# Defining a function to use --> get_latlng()'''
def get_latlng(arcgis_geocoder):
# Initialize the Location (lat. and long.) to "None"
lat_lng_coords = None
# While loop helps to create a continous run until all the location coordinates are geocoded
while(lat_lng_coords is None):
g = geocoder.arcgis('{}, London, United Kingdom'.format(arcgis_geocoder))
lat_lng_coords = g.latlng
return lat_lng_coords
# Geocoder ends here
sample = get_latlng('SW6')
sample
ga = geocoder.geocodefarm(sample, method = 'reverse')
ga
start = time.time()
postal_codes = df_sw_top['Postcode']
coordinates = [get_latlng(postal_code) for postal_code in postal_codes.tolist()]
end = time.time()
print("Time of execution: ", end - start, "seconds")
Then we proceed to store the location data — latitude and longitude as follows. The obtained coordinates are then joined to df_sw_topto create new data frame.
df_sw_loc = df_sw_top
# The obtained coordinates (latitude and longitude) are joined with the dataframe as shown
df_sw_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
df_sw_loc['Latitude'] = df_sw_coordinates['Latitude']
df_sw_loc['Longitude'] = df_sw_coordinates['Longitude']
df_sw_loc.head(5)
df_sw_loc.shape
In orderting to use Foursquare, it is needed the credentials, that are saved in the file credential.json
import json
filename = 'credential.json'
with open(filename) as f:
data = json.load(f)
CLIENT_ID = data['CLIENT_ID'] #Foursquare )FS) ID
CLIENT_SECRET = data['CLIENT_SECRET'] # FS Secret
VERSION = data['VERSION'] # FS API version
#LIMIT = 30
#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)
3. Methodology¶
With the Foursquare site, we are going to obtain a data set of stores around specific locations, which will be stored into a data frame and with that, it is possible to do the analytics.
First, we will proceed with the exploration and the collecting of data over a single Neighbourhood and later we proceed the same procedure with multiple Neighbourhoods storing a data frame.
The following step is get the cluster of the venues by location and identify the top most cooomon venues by cluster.
3.1 Data Exploration¶
Single Neighbourhood¶
An initial exploration of a single Neighbourhood within the London area was done to examine the Foursquare.
# Resets the current index to a new
sw_df = df_sw_loc.reset_index().drop('index', axis = 1)
sw_df.loc[sw_df['Location'] == 'Fulham']
let’s use the Fulham with the index location 2 as shown below:
Fulham_lat = sw_df.loc[2, 'Latitude']
Fulham_long = sw_df.loc[2, 'Longitude']
Fulham_loc = sw_df.loc[2, 'Location']
Fulham_postcode = sw_df.loc[2, 'Postcode']
print('The latitude and longitude values of {} with postcode {}, are {}, {}.'.format(Fulham_loc, Fulham_postcode, Fulham_lat, Fulham_long))
Let’s explore the top 50 venues that are within a 1500 metres radius of Fulham . And then, let’s create the GET request URL, and then the url is named.
# Credentials are provided already for this part
LIMIT = 50 # limit of number of venues returned by Foursquare API
radius = 1500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
Fulham_lat,
Fulham_long,
radius,
LIMIT)
# displays URL
url
results = requests.get(url).json()
From the results, the necessary information needs to be obtained from items key. To do this, the get_category_type function is used from the Foursquare.
# function that extracts the category of the venue
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
The result is then cleaned up from json to a structured pandas dataframe as shown below:
import numpy as np
import pandas as pd
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head(10)
nearby_venues_Fulham_unique = nearby_venues['categories'].value_counts().to_frame(name='Count')
The most common venues in Fulhan are:
nearby_venues_Fulham_unique.head(5)
So in Fulham the top nearby venues are cafe, pizza place, pub, Grocery Store and fish & chips Shop.
Multiple Neighbourhoods¶
In ordering to repeat the previous procedure applied to multiple Neighbourhoods, we define a function:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude', 'Venue Longitude',
'Venue Category']
return(nearby_venues)
and we use this function by using our data frame sw_df as follows:
sw_venues = getNearbyVenues(names=sw_df['Location'],
latitudes=sw_df['Latitude'],
longitudes=sw_df['Longitude']
)
sw_venues.shape
sw_venues.head(5)
The number of venues returned for each neighbourhoods is then explored as follows
sw_venues.groupby('Neighborhood').count()
Then we check how many unique categories all the returned venues. See as follows
print('There are {} uniques categories.'.format(len(sw_venues['Venue Category'].unique())))
sw_venue_unique_count = sw_venues['Venue Category'].value_counts().to_frame(name='Count')
sw_venue_unique_count.head()
3.2 Clustering¶
address = 'London, United Kingdom'
geolocator = Nominatim(user_agent="ln_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))
map_london = folium.Map(location = [latitude, longitude], zoom_start = 11)
map_london
# Adding markers to map
for lat, lng, borough, loc in zip(sw_df['Latitude'],
sw_df['Longitude'],
sw_df['Borough'],
sw_df['Location']):
label = '{} - {}'.format(loc, borough)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7).add_to(map_london)
display(map_london)
We see four points which corresponds to our four postcodes that we want to analize.
sw_df
4. Data analysis¶
In this section, the objective is to check and explore the venues in each neighbourhood.
# one hot encoding
sw_onehot = pd.get_dummies(sw_venues[['Venue Category']], prefix = "", prefix_sep = "")
Then the Neighbourhoodcolumn is added back to the dataframe.
# add neighborhood column back to dataframe
sw_onehot['Neighborhood'] = sw_venues['Neighborhood']
# move neighborhood column to the first column
fixed_columns = [sw_onehot.columns[-1]] + list(sw_onehot.columns[:-1])
sw_onehot = sw_onehot[fixed_columns]
#sw_onehot.head()
# To check the Bakery:
#sw_onehot.loc[sw_onehot['Bakery'] != 0]
Regrouping and Category Statistics
sw_grouped = sw_onehot.groupby('Neighborhood').mean().reset_index()
Grouping of each Neighbourhoods with 10 common venues:
num_top_venues = 10 # Top common venues needed
for hood in sw_grouped['Neighborhood']:
print("----"+hood+"----")
temp = sw_grouped[sw_grouped['Neighborhood'] == hood].T.reset_index()
temp.columns = ['venue', 'freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending = False).reset_index(drop = True).head(num_top_venues))
print('\n')
Creating new dataframe Putting the common venues into pandas dataframe, the following return_most_common_venuesis used to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending = False)
return row_categories_sorted.index.values[0:num_top_venues]
Then we create a new panda dataframe with 10 most common venues as shown below:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighborhood'] = sw_grouped['Neighborhood']
for ind in np.arange(sw_grouped.shape[0]):
neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sw_grouped.iloc[ind, :], num_top_venues)
neighbourhoods_venues_sorted.head(5)
Clustering of Neighbourhoods We create the grouped clustering for the neighbourhood as shown below:
sw_grouped_clustering = sw_grouped.drop('Neighborhood', 1)
And then create clusters of the neighbourhood using the k-means to cluster the neighbourhood into 5 cluster
# set number of clusters
kclusters = 5
# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state=0).fit(sw_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
Now creating a new dataframe that includes the clusters as well as the top 10 venues for each neighbourhoods.
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
sw_merged = sw_df
# match/merge SE London data with latitude/longitude for each neighborhood
sw_merged_latlong = sw_merged.join(neighbourhoods_venues_sorted.set_index('Neighborhood'), on = 'Location')
sw_merged_latlong.head(5)
5. Results¶
To visualize the clusters, we have the following
sw_clusters=sw_merged_latlong
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sw_clusters['Latitude'], sw_clusters['Longitude'], sw_clusters['Location'], sw_clusters['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=20,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
display(map_clusters)
The individual clsuters dataframe can be obtained by using the following code (better to run each cluster individually:
# Cluster 1
cluster1=sw_clusters.loc[sw_clusters['Cluster Labels'] == 0, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]
# Cluster 2
cluster2=sw_clusters.loc[sw_clusters['Cluster Labels'] == 1, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]
# Cluster 3
cluster3=sw_clusters.loc[sw_clusters['Cluster Labels'] == 2, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]
# Cluster 4
cluster4=sw_clusters.loc[sw_clusters['Cluster Labels'] == 3, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]
# Cluster 5
cluster5=sw_clusters.loc[sw_clusters['Cluster Labels'] == 4, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]
cluster1
cluster2
cluster3
cluster4
cluster5
Conclusion¶
According to the latest Land Registry figures Westminster is one the most expensive boroughs to buy a property to live in London. Slightly less than a million – £990,896, is the average cost of property to buy in City of Westminster. However, considering having a view on London Eye or Big Ben while eating breakfast, houseprices in this area sound to be reasonable. The Hammersmith and Fulham has avg. £784,613 cost of property.
So the conclusion should be invest in one of the common venue :
Hammersmith and Fulham¶
- Café
- Coffee Shop
- Italian Restaurant
- Pub
- Yoga Studio
- Grocery Store
- Wine Shop
- Park
- French Restaurant
- Climbing Gy
Westminster¶
- Hotel
- Theater
- Coffee Sho
- Sandwich Place
- Juice Bar
- Sushi Restaurant
- Indian Restaurant
- Movie Theater
- Modern European Restaurant
- Clothing Store
Merton¶
- Pub
- Coffee Shop
- Sushi Restaurant
- Bar
- Indian Restaurant
- Stationery Store
- Burger Joint
- Grocery Store
- Movie Theater
- Mexican Restauran
which is based on cluster 2.