Capstone project on Jupyter Notebook¶

Small Business Investment in London¶

London's finance industry is based in the City of London and Canary Wharf, the two major business districts in London. London is one of the preeminent financial centers of the world as the most important location for international finance. We are going to discuss what are the best venues which we can invest in London, we present an analysis made in phyton and we show step by step how we can determine the top 10 venues for the best 3 Boroughs based on the demographic data, unemployment rate statistics which it is suggested to invest in London .

The content of this work has five sections:

Getting the the data
Data wrangling
Methodology
Data analysis
Results and Discussion

London is one of the most attractive cities to do business in the world. It is the capital of both England and U.K. In this notebook we shows how we can study data from UK to understand how and where to invest the money with an small business.

1. Getting the the data¶

First at all we are going to use iPython based in phyton 3.6 to do the analysis.

1.1 Setup¶

In the setup part we simply need load all the libraries needed to do this study that is given by:

# library for BeautifulSoup, for web scrapping
from bs4 import BeautifulSoup
# library to handle data in a vectorized manner
import numpy as np
# library for data analsysis
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# library to handle JSON files
import json
print('numpy, pandas, ..., imported...')
!pip -q install geopy
print('geopy installed...')
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
print('Nominatim imported...')
# library to handle requests
import requests
print('requests imported...')
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize
print('json_normalize imported...')
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
print('matplotlib imported...')
# import k-means from clustering stage
from sklearn.cluster import KMeans
print('Kmeans imported...')
# install the Geocoder
!pip -q install geocoder
import geocoder
# import time
import time
!pip -q install folium
print('folium installed...')
import folium # map rendering library
print('folium imported...')
from pandas import ExcelWriter
from pandas import ExcelFile
print('...Done')

numpy, pandas, ..., imported...
geopy installed...
Nominatim imported...
requests imported...
json_normalize imported...
matplotlib imported...
Kmeans imported...
folium installed...
folium imported...
...Done

import warnings
warnings.filterwarnings('ignore')

1.2 Selecting the data¶

In ordering to determine the area of London where we have to do the investment we require take some data from the London's Poverty Profile 2017 and Annual Population Survey via Nomis, ONS. given at the Trust for London [ 1 ], and New Policy Institute [ 2 ].

The unemployment ratio is the proportion of the working-age population that is unemployed, and in the unemployment ratio by borough in London is showed in the following figure, that is obtained by the following commands:

unemployment_ratio_df = pd.read_csv("unemployment_ratio.csv") 
unemployment_ratio_df2=unemployment_ratio_df.dropna()
unemployment_ratio_df2.rename({'Unnamed: 1': '2001', 'Unnamed: 2': '2011'}, axis=1, inplace=True)
unemployment_ratio_df2.reset_index(drop=True, inplace=True)
unemployment_ratio_df2['2011'] = unemployment_ratio_df2['2011'].str[:-1].astype(float)
unemployment_ratio_df2['2001'] = unemployment_ratio_df2['2001'].str[:-1].astype(float)
unemployment_ratio_df3=unemployment_ratio_df2.sort_values(by='2011', ascending=True)
unemployment_ratio_df3.plot(x="Unemployment ratio by borough", y=["2001", "2011"], kind="bar")

<matplotlib.axes._subplots.AxesSubplot at 0x1e838fa1cc0>

This graph shows that the ratio has come down significantly in almost all London boroughs in a relatively short timescale.¶

Considering the lowest unemployment ratio by borough we have the following table:

unemployment_ratio_df3.head()

The top 3 boroughs with lowest unemployment ratio in 2011 are :

Hammersmith and Fulham
Wandsworth
Kingston upon Thames

Now if we compare with the change of Change in unemployment ratio 2011-13 to 2014-1

change_unemployment_ratio_df = pd.read_csv("change_unemployment_ratio.csv") 
change_unemployment_ratio_df=change_unemployment_ratio_df.dropna()
change_unemployment_ratio_df.head()
change_unemployment_ratio_df.rename({'Change in unemployment ratio 2011-13 to 2014-16': 'Borough', 'Unnamed: 1': 'Percentual'}, axis=1, inplace=True)
change_unemployment_ratio_df['Percentual'] = change_unemployment_ratio_df['Percentual'].str[:-1].astype(float)
change_unemployment_ratio_df2=change_unemployment_ratio_df.sort_values(by='Percentual', ascending=True)
change_unemployment_ratio_df2.plot(kind='bar',x='Borough',y='Percentual')

<matplotlib.axes._subplots.AxesSubplot at 0x1e8398bc208>

change_unemployment_ratio_df2.head(10)

And the highest change unemployment ratio are for the Boroughs:

Newham
Croydon
Ealing

But this fact does not guarantee that will be a good place to invest. So in ordering to be sure about which Borough should be the best to invest,we consider the average percentual, this can be a good indication to determine which Borough we should choose.

change_unemployment_ratio_df2["Percentual"].mean()

-1.9897500000000001

therefore arround -1.98% the Borough shold be stable.

Hammersmith and Fulham has -2.200%
Haringey -2.134%
Waltham Forest -2.036%

We have those three candidates, but due to Hammersmith and Fulham has the lowest Unemployment , a good candidate to invest is Hammersmith and Fulham which is located in the South West and North West of London.

Thus we are going to explore what are the possible small bussiness what have a trend over the South West of London.

One of the strongest regions and cities on the planet is London. It’s highly attractive, and it offers one of the best prospects for business development and innovative thinking. London is the showcase for our work throughout Europe and is paramount in our global strategy.

2. Data wrangling¶

Data wrangling sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

In this section we are going to extract more data from different sources of London and transform them into dataframes in ordering to perfom the analytics.

2.1 Exploration of the city of London¶

In order to transform data from websites into data which we can process we requiere use a great tool called BeautifulSoup, we load the libraries and we get the List of areas of London.

# library for BeautifulSoup
from bs4 import BeautifulSoup
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_areas_of_London'
wikipedia_page = requests.get(wikipedia_link)

# Cleans html file
soup = BeautifulSoup(wikipedia_page.content, 'html.parser')
# This extracts the "tbody" within the table where class is "wikitable sortable"
table = soup.find('table', {'class':'wikitable sortable'}).tbody
# Extracts all "tr" (table rows) within the table above
rows = table.find_all('tr')
# Extracts the column headers, removes and replaces possible '\n' with space for the "th" tag
columns = [i.text.replace('\n', '')
           for i in rows[0].find_all('th')]
# Converts columns to pd dataframe
df = pd.DataFrame(columns = columns)
'''
Extracts every row with corresponding columns then appends the values to the create pd dataframe "df". The first row (row[0]) is skipped because it is already the header
'''
for i in range(1, len(rows)):
    tds = rows[i].find_all('td')    
    if len(tds) == 7:
        values = [tds[0].text, tds[1].text, tds[2].text.replace('\n', ''.replace('\xa0','')), tds[3].text, tds[4].text.replace('\n', ''.replace('\xa0','')), tds[5].text.replace('\n', ''.replace('\xa0','')), tds[6].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
        
        df = df.append(pd.Series(values, index = columns), ignore_index = True)
        df

df.head(5)

df.columns = ['Location', 'Borough', 'Post-town', 'Postcode',
       'Dial-code', 'OSgridref']

df.columns

Index(['Location', 'Borough', 'Post-town', 'Postcode', 'Dial-code',
       'OSgridref'],
      dtype='object')

df.head()

We have to perfomr some cleaning with the following command

# Remove Borough reference numbers with []
df['Borough'] = df['Borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))

df.head()

We are going now to do some assumptions to reduce the amount of data that we want to process.

The postcodes are spread to multi-rows and assigned the same values from the other columns.

df0 = df.drop('Postcode', axis=1).join(df['Postcode'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('Postcode'))

df0.head()

From the data, only the ‘Location’, ‘Borough’, ‘Postcode’, ‘Post-town’ will be used.

df1 = df0[['Location', 'Borough', 'Postcode', 'Post-town']].reset_index(drop=True)

df1.head()

Now, only the Boroughs with London Post-town will be used for our search of location. Therefore, all the non-post-town are dropped.

df2 = df1 # assigns df1 to df2
df21 = df2[df2['Post-town'].str.contains('LONDON')]

df21.shape

(380, 4)

df3 = df21[['Location', 'Borough', 'Postcode']].reset_index(drop=True)

df3.head()

Due to the study of the Bouroughts, Hammersmith and Fulham are in the North West and South West areas of London. For this project the South West will be considered for our analysis. The south west areas has postcodes starting with SW

df_london = df3 # re-assigns to df_london
# Strips whitespaces before postcode
df_london.Postcode = df_london.Postcode.str.strip()
# New dataframe for South East London postcodes - df_se
df_sw = df_london[df_london['Postcode'].str.startswith(('SW'))].reset_index(drop=True)

df_sw.head(10)

We are intterested to see the demography of London for white people.

demograph_link = 'https://en.wikipedia.org/wiki/Demography_of_London'
demograph_page = requests.get(demograph_link)
soup1 = BeautifulSoup(demograph_page.content, 'html.parser')
table1 = soup1.find('table', {'class':'wikitable sortable'}).tbody
rows1 = table1.find_all('tr')
columns1 = [i.text.replace('\n', '')
 for i in rows1[0].find_all('th')]
demo_london = pd.DataFrame(columns = columns1)
for j in range(1, len(rows1)):
    tds1 = rows1[j].find_all('td')
    if len(tds1) == 7:
        values1 = [tds1[0].text, tds1[1].text, tds1[2].text.replace('\n', ''.replace('\xa0','')), tds1[3].text, tds1[4].text.replace('\n', ''.replace('\xa0','')), tds1[5].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values1 = [td1.text.replace('\n', '').replace('\xa0','') for td1 in tds1]
        
        demo_london = demo_london.append(pd.Series(values1, index = columns1), ignore_index = True)
demo_london

#converting string to float

demo_london['White'] = demo_london['White'].astype('float')
demo_london_sorted = demo_london.sort_values(by='White', ascending = False)

demo_london_sorted.head(10)

demo_london_sorted["White"].mean()

61.584848484848486

Considering the the top 6 areas with higher of the average white people around > 61.58% we have the following list of Boroughs:

Hammersmith and Fulham 68.1%
Camden 66.3%
Merton 64.9%
Barnet 64.1%
Greenwich 62.5%
Westminster 61.7%

df_sw_top = df_sw[df_sw['Borough'].isin(['Hammersmith and Fulham','Camden', 'Merton', 'Barnet','Greenwich', 'Westminster'])].reset_index(drop=True)

df_sw_top

df_sw_top.shape

(15, 3)

# Geocoder starts here
# Defining a function to use --> get_latlng()'''
def get_latlng(arcgis_geocoder):
    
    # Initialize the Location (lat. and long.) to "None"
    lat_lng_coords = None
    
    # While loop helps to create a continous run until all the location coordinates are geocoded
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, United Kingdom'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords
# Geocoder ends here

sample = get_latlng('SW6')

sample

[51.47772000000003, -0.2014499999999657]

ga = geocoder.geocodefarm(sample, method = 'reverse')
ga

<[OK] Geocodefarm - Reverse [687b Fulham Road, London, SW6 5UJ, United Kingdom]>

start = time.time()
postal_codes = df_sw_top['Postcode']    
coordinates = [get_latlng(postal_code) for postal_code in postal_codes.tolist()]
end = time.time()
print("Time of execution: ", end - start, "seconds")

Time of execution:  11.869275093078613 seconds

Then we proceed to store the location data — latitude and longitude as follows. The obtained coordinates are then joined to df_sw_topto create new data frame.

df_sw_loc = df_sw_top
# The obtained coordinates (latitude and longitude) are joined with the dataframe as shown
df_sw_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
df_sw_loc['Latitude'] = df_sw_coordinates['Latitude']
df_sw_loc['Longitude'] = df_sw_coordinates['Longitude']
df_sw_loc.head(5)

df_sw_loc.shape

(15, 5)

In orderting to use Foursquare, it is needed the credentials, that are saved in the file credential.json

import json
filename = 'credential.json'
with open(filename) as f:
    data = json.load(f)

CLIENT_ID = data['CLIENT_ID'] #Foursquare )FS) ID
CLIENT_SECRET = data['CLIENT_SECRET'] # FS Secret
VERSION = data['VERSION'] # FS API version

#LIMIT = 30
#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

3. Methodology¶

With the Foursquare site, we are going to obtain a data set of stores around specific locations, which will be stored into a data frame and with that, it is possible to do the analytics.

First, we will proceed with the exploration and the collecting of data over a single Neighbourhood and later we proceed the same procedure with multiple Neighbourhoods storing a data frame.

The following step is get the cluster of the venues by location and identify the top most cooomon venues by cluster.

3.1 Data Exploration¶

Single Neighbourhood¶

An initial exploration of a single Neighbourhood within the London area was done to examine the Foursquare.

# Resets the current index to a new
sw_df = df_sw_loc.reset_index().drop('index', axis = 1)
sw_df.loc[sw_df['Location'] == 'Fulham']

let’s use the Fulham with the index location 2 as shown below:

Fulham_lat = sw_df.loc[2, 'Latitude']
Fulham_long = sw_df.loc[2, 'Longitude']
Fulham_loc = sw_df.loc[2, 'Location']
Fulham_postcode = sw_df.loc[2, 'Postcode']
print('The latitude and longitude values of {} with postcode {}, are {}, {}.'.format(Fulham_loc, Fulham_postcode, Fulham_lat, Fulham_long))

The latitude and longitude values of Fulham with postcode SW6, are 51.47772000000003, -0.2014499999999657.

Let’s explore the top 50 venues that are within a 1500 metres radius of Fulham . And then, let’s create the GET request URL, and then the url is named.

# Credentials are provided already for this part
LIMIT = 50 # limit of number of venues returned by Foursquare API
radius = 1500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Fulham_lat, 
    Fulham_long, 
    radius, 
    LIMIT)
# displays URL
url

'https://api.foursquare.com/v2/venues/explore?&client_id=VV415VZUJXDXFEELDKJ45UIUKUEY1R2X3MUUEO5PBVHCQ5XB&client_secret=RPJJV01CL5EG31FBDK1FTWD4BMCJZVXVR0LWIWD03QQBI3N0&v=20180604&ll=51.47772000000003,-0.2014499999999657&radius=1500&limit=50'

results = requests.get(url).json()

From the results, the necessary information needs to be obtained from items key. To do this, the get_category_type function is used from the Foursquare.

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

The result is then cleaned up from json to a structured pandas dataframe as shown below:

import numpy as np
import pandas as pd

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head(10)

nearby_venues_Fulham_unique = nearby_venues['categories'].value_counts().to_frame(name='Count')

The most common venues in Fulhan are:

nearby_venues_Fulham_unique.head(5)

So in Fulham the top nearby venues are cafe, pizza place, pub, Grocery Store and fish & chips Shop.

Multiple Neighbourhoods¶

In ordering to repeat the previous procedure applied to multiple Neighbourhoods, we define a function:

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

and we use this function by using our data frame sw_df as follows:

sw_venues = getNearbyVenues(names=sw_df['Location'],
                                   latitudes=sw_df['Latitude'],
                                   longitudes=sw_df['Longitude']
                                  )

Belgravia
Colliers Wood
Fulham
Knightsbridge
Merton Park
Millbank
Parsons Green
Pimlico
Raynes Park
Sands End
South Wimbledon
St James's
Westminster
Wimbledon
Wimbledon

sw_venues.shape

(714, 7)

sw_venues.head(5)

The number of venues returned for each neighbourhoods is then explored as follows

sw_venues.groupby('Neighborhood').count()

Then we check how many unique categories all the returned venues. See as follows

print('There are {} uniques categories.'.format(len(sw_venues['Venue Category'].unique())))

There are 79 uniques categories.

sw_venue_unique_count = sw_venues['Venue Category'].value_counts().to_frame(name='Count')

sw_venue_unique_count.head()

3.2 Clustering¶

address = 'London, United Kingdom'
geolocator = Nominatim(user_agent="ln_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London are 51.5073219, -0.1276474.

map_london = folium.Map(location = [latitude, longitude], zoom_start = 11)
map_london

# Adding markers to map
for lat, lng, borough, loc in zip(sw_df['Latitude'], 
                                  sw_df['Longitude'],
                                  sw_df['Borough'],
                                  sw_df['Location']):
    label = '{} - {}'.format(loc, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_london)  
    
display(map_london)

We see four points which corresponds to our four postcodes that we want to analize.

sw_df

4. Data analysis¶

In this section, the objective is to check and explore the venues in each neighbourhood.

# one hot encoding
sw_onehot = pd.get_dummies(sw_venues[['Venue Category']], prefix = "", prefix_sep = "")

Then the Neighbourhoodcolumn is added back to the dataframe.

# add neighborhood column back to dataframe
sw_onehot['Neighborhood'] = sw_venues['Neighborhood']

# move neighborhood column to the first column
fixed_columns = [sw_onehot.columns[-1]] + list(sw_onehot.columns[:-1])
sw_onehot = sw_onehot[fixed_columns]

#sw_onehot.head()

# To check the Bakery:
#sw_onehot.loc[sw_onehot['Bakery'] != 0]

Regrouping and Category Statistics

sw_grouped = sw_onehot.groupby('Neighborhood').mean().reset_index()

Grouping of each Neighbourhoods with 10 common venues:

num_top_venues = 10 # Top common venues needed
for hood in sw_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = sw_grouped[sw_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending = False).reset_index(drop = True).head(num_top_venues))
    print('\n')

----Belgravia----
                  venue  freq
0                 Hotel  0.12
1           Coffee Shop  0.08
2               Theater  0.08
3        Sandwich Place  0.06
4             Juice Bar  0.04
5      Sushi Restaurant  0.04
6  Gym / Fitness Center  0.02
7    Falafel Restaurant  0.02
8  Fast Food Restaurant  0.02
9     French Restaurant  0.02


----Colliers Wood----
                  venue  freq
0      Sushi Restaurant  0.08
1           Coffee Shop  0.08
2                   Pub  0.08
3                   Bar  0.06
4     Indian Restaurant  0.04
5      Stationery Store  0.04
6          Burger Joint  0.04
7         Grocery Store  0.04
8           Pizza Place  0.02
9  Gym / Fitness Center  0.02


----Fulham----
                venue  freq
0                Café  0.12
1         Coffee Shop  0.10
2                 Pub  0.06
3  Italian Restaurant  0.06
4         Yoga Studio  0.04
5                Park  0.04
6       Grocery Store  0.04
7           Gastropub  0.04
8   French Restaurant  0.04
9        Climbing Gym  0.04


----Knightsbridge----
                  venue  freq
0                 Hotel  0.12
1           Coffee Shop  0.08
2               Theater  0.08
3        Sandwich Place  0.06
4             Juice Bar  0.04
5      Sushi Restaurant  0.04
6  Gym / Fitness Center  0.02
7    Falafel Restaurant  0.02
8  Fast Food Restaurant  0.02
9     French Restaurant  0.02


----Merton Park----
                  venue  freq
0      Sushi Restaurant  0.08
1           Coffee Shop  0.08
2                   Pub  0.08
3                   Bar  0.06
4     Indian Restaurant  0.04
5      Stationery Store  0.04
6          Burger Joint  0.04
7         Grocery Store  0.04
8           Pizza Place  0.02
9  Gym / Fitness Center  0.02


----Millbank----
                  venue  freq
0                 Hotel  0.12
1           Coffee Shop  0.08
2               Theater  0.08
3        Sandwich Place  0.06
4             Juice Bar  0.04
5      Sushi Restaurant  0.04
6  Gym / Fitness Center  0.02
7    Falafel Restaurant  0.02
8  Fast Food Restaurant  0.02
9     French Restaurant  0.02


----Parsons Green----
                venue  freq
0                Café  0.12
1         Coffee Shop  0.10
2                 Pub  0.06
3  Italian Restaurant  0.06
4         Yoga Studio  0.04
5                Park  0.04
6       Grocery Store  0.04
7           Gastropub  0.04
8   French Restaurant  0.04
9        Climbing Gym  0.04


----Pimlico----
                  venue  freq
0                 Hotel  0.12
1           Coffee Shop  0.08
2               Theater  0.08
3        Sandwich Place  0.06
4             Juice Bar  0.04
5      Sushi Restaurant  0.04
6  Gym / Fitness Center  0.02
7    Falafel Restaurant  0.02
8  Fast Food Restaurant  0.02
9     French Restaurant  0.02


----Raynes Park----
                  venue  freq
0              Bus Stop  0.12
1              Platform  0.09
2     Indian Restaurant  0.06
3  Fast Food Restaurant  0.06
4         Grocery Store  0.06
5           Coffee Shop  0.06
6              Pharmacy  0.06
7                   Pub  0.03
8  Gym / Fitness Center  0.03
9                 Hotel  0.03


----Sands End----
                venue  freq
0                Café  0.12
1         Coffee Shop  0.10
2                 Pub  0.06
3  Italian Restaurant  0.06
4         Yoga Studio  0.04
5                Park  0.04
6       Grocery Store  0.04
7           Gastropub  0.04
8   French Restaurant  0.04
9        Climbing Gym  0.04


----South Wimbledon----
                  venue  freq
0      Sushi Restaurant  0.08
1           Coffee Shop  0.08
2                   Pub  0.08
3                   Bar  0.06
4     Indian Restaurant  0.04
5      Stationery Store  0.04
6          Burger Joint  0.04
7         Grocery Store  0.04
8           Pizza Place  0.02
9  Gym / Fitness Center  0.02


----St James's----
                  venue  freq
0                 Hotel  0.12
1           Coffee Shop  0.08
2               Theater  0.08
3        Sandwich Place  0.06
4             Juice Bar  0.04
5      Sushi Restaurant  0.04
6  Gym / Fitness Center  0.02
7    Falafel Restaurant  0.02
8  Fast Food Restaurant  0.02
9     French Restaurant  0.02


----Westminster----
                  venue  freq
0                 Hotel  0.12
1           Coffee Shop  0.08
2               Theater  0.08
3        Sandwich Place  0.06
4             Juice Bar  0.04
5      Sushi Restaurant  0.04
6  Gym / Fitness Center  0.02
7    Falafel Restaurant  0.02
8  Fast Food Restaurant  0.02
9     French Restaurant  0.02


----Wimbledon----
                  venue  freq
0           Coffee Shop  0.07
1                   Pub  0.06
2      Sushi Restaurant  0.05
3              Bus Stop  0.05
4     Indian Restaurant  0.05
5         Grocery Store  0.05
6                   Bar  0.04
7          Burger Joint  0.04
8              Platform  0.04
9  Fast Food Restaurant  0.02

Creating new dataframe Putting the common venues into pandas dataframe, the following return_most_common_venuesis used to sort the venues in descending order.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Then we create a new panda dataframe with 10 most common venues as shown below:

num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighborhood'] = sw_grouped['Neighborhood']
for ind in np.arange(sw_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sw_grouped.iloc[ind, :], num_top_venues)
neighbourhoods_venues_sorted.head(5)

Clustering of Neighbourhoods We create the grouped clustering for the neighbourhood as shown below:

sw_grouped_clustering = sw_grouped.drop('Neighborhood', 1)

And then create clusters of the neighbourhood using the k-means to cluster the neighbourhood into 5 cluster

# set number of clusters
kclusters = 5
# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state=0).fit(sw_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 1, 2, 0, 1, 0, 2, 0, 3, 2])

Now creating a new dataframe that includes the clusters as well as the top 10 venues for each neighbourhoods.

# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
sw_merged = sw_df
# match/merge SE London data with latitude/longitude for each neighborhood
sw_merged_latlong = sw_merged.join(neighbourhoods_venues_sorted.set_index('Neighborhood'), on = 'Location')
sw_merged_latlong.head(5)

5. Results¶

To visualize the clusters, we have the following

sw_clusters=sw_merged_latlong

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sw_clusters['Latitude'], sw_clusters['Longitude'], sw_clusters['Location'], sw_clusters['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

display(map_clusters)

The individual clsuters dataframe can be obtained by using the following code (better to run each cluster individually:

# Cluster 1
cluster1=sw_clusters.loc[sw_clusters['Cluster Labels'] == 0, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]

# Cluster 2
cluster2=sw_clusters.loc[sw_clusters['Cluster Labels'] == 1, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]

# Cluster 3
cluster3=sw_clusters.loc[sw_clusters['Cluster Labels'] == 2, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]

# Cluster 4
cluster4=sw_clusters.loc[sw_clusters['Cluster Labels'] == 3, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]

# Cluster 5
cluster5=sw_clusters.loc[sw_clusters['Cluster Labels'] == 4, sw_clusters.columns[[1] + list(range(5, sw_clusters.shape[1]))]]

cluster1

cluster2

cluster3

cluster4

cluster5

Conclusion¶

According to the latest Land Registry figures Westminster is one the most expensive boroughs to buy a property to live in London. Slightly less than a million – £990,896, is the average cost of property to buy in City of Westminster. However, considering having a view on London Eye or Big Ben while eating breakfast, houseprices in this area sound to be reasonable. The Hammersmith and Fulham has avg. £784,613 cost of property.

So the conclusion should be invest in one of the common venue :

Hammersmith and Fulham¶

Café
Coffee Shop
Italian Restaurant
Pub
Yoga Studio
Grocery Store
Wine Shop
Park
French Restaurant
Climbing Gy

Westminster¶

Hotel
Theater
Coffee Sho
Sandwich Place
Juice Bar
Sushi Restaurant
Indian Restaurant
Movie Theater
Modern European Restaurant
Clothing Store

Merton¶

Pub
Coffee Shop
Sushi Restaurant
Bar
Indian Restaurant
Stationery Store
Burger Joint
Grocery Store
Movie Theater
Mexican Restauran

which is based on cluster 2.

	Unemployment ratio by borough	2001	2011
31	Hammersmith and Fulham	5.7	3.5
30	Wandsworth	6.3	3.5
29	Kingston upon Thames	5.5	3.7
28	Camden	5.4	3.9
27	Sutton	5.8	3.9

	Borough	Percentual
25	Newham	-3.825
8	Croydon	-3.699
9	Ealing	-3.664
10	Enfield	-3.528
22	Lambeth	-3.031
2	Barking and Dagenham	-2.945
23	Lewisham	-2.804
32	Wandsworth	-2.787
17	Hillingdon	-2.741
5	Brent	-2.724

	Location	London borough	Post town	Postcode district	Dial code	OS grid ref
0	Abbey Wood	Bexley, Greenwich [2]	LONDON	SE2	020	TQ465785
1	Acton	Ealing, Hammersmith and Fulham[3]	LONDON	W3, W4	020	TQ205805
2	Addington	Croydon[3]	CROYDON	CR0	020	TQ375645
3	Addiscombe	Croydon[3]	CROYDON	CR0	020	TQ345665
4	Albany Park	Bexley	BEXLEY, SIDCUP	DA5, DA14	020	TQ478728

	Location	Borough	Post-town	Postcode	Dial-code	OSgridref
0	Abbey Wood	Bexley, Greenwich [2]	LONDON	SE2	020	TQ465785
1	Acton	Ealing, Hammersmith and Fulham[3]	LONDON	W3, W4	020	TQ205805
2	Addington	Croydon[3]	CROYDON	CR0	020	TQ375645
3	Addiscombe	Croydon[3]	CROYDON	CR0	020	TQ345665
4	Albany Park	Bexley	BEXLEY, SIDCUP	DA5, DA14	020	TQ478728

	Location	Borough	Post-town	Postcode	Dial-code	OSgridref
0	Abbey Wood	Bexley, Greenwich	LONDON	SE2	020	TQ465785
1	Acton	Ealing, Hammersmith and Fulham	LONDON	W3, W4	020	TQ205805
2	Addington	Croydon	CROYDON	CR0	020	TQ375645
3	Addiscombe	Croydon	CROYDON	CR0	020	TQ345665
4	Albany Park	Bexley	BEXLEY, SIDCUP	DA5, DA14	020	TQ478728

Data Science - Small Business Investment in London

Sunday, July 14, 2019

Top small business in London - Analysis in Python