Skip to content Skip to sidebar Skip to footer

How To Clean Data So That The Correct Arrival Code Is There For The City Pair?

How to clean data so that the correct arrival code is there for the city pair? From the picture, the CSV is like column 1: City Pair (Departure - Arrival), column 2 is meant to be

Solution 1:

As an example, and using the cleaner looking data from your other question...

Given:

enter image description here

..from your other question.

Try:

import pandas as pd
import numpy as np
import math
from math import sin, cos, sqrt, atan2, radians

def get_distance(in_lat1, in_lon1, in_lat2, in_lon2):
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(in_lat1)
    lon1 = radians(in_lon1)
    lat2 = radians(in_lat2)
    lon2 = radians(in_lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

df = pd.DataFrame({'Normalised City Pair': {0: 'London, United Kingdom - New York, United States',
  1: 'Johannesburg, South Africa - London, United Kingdom',
  2: 'London, United Kingdom - New York, United States',
  3: 'Johannesburg, South Africa - London, United Kingdom',
  4: 'London, United Kingdom - Singapore, Singapore'},
 'Departure Code': {0: 'LHR', 1: 'JNB', 2: 'LHR', 3: 'JNB', 4: 'SIN'},
 'Arrival Code': {0: 'JFK', 1: 'LHR', 2: 'JFK', 3: 'LHR', 4: 'LHR'},
 'Departure_lat': {0: 51.5, 1: -26.1, 2: 51.5, 3: -26.1, 4: 1.3},
 'Departure_lon': {0: -0.45, 1: 28.23, 2: -0.45, 3: 28.23, 4: 103.98},
 'Arrival_lat': {0: 40.64, 1: 51.47, 2: 40.64, 3: 51.47, 4: 51.47},
 'Arrival_lon': {0: -73.79, 1: -0.45, 2: -73.79, 3: -0.45, 4: -0.45}})

df_airports = pd.read_csv('https://ourairports.com/data/airports.csv')
df_airports = df_airports[['name', 'iata_code']].copy()
df_airports = df_airports[df_airports['iata_code'].notna()].reset_index(drop=True)
# df_airports.query('iata_code == "CDG" | iata_code == "LHR"')

df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1)

#df[['ap_dep', 'ap_arr']] = df['Normalised City Pair'].str.split(' - ', expand=True)

df_airports = df_airports.sort_values('name')

df_airports = df_airports.drop_duplicates(subset ='iata_code', keep='first')

df['dep_ap_name'] = df['Departure Code'].map(df_airports.set_index('iata_code')['name'])
df['arr_ap_name'] = df['Arrival Code'].map(df_airports.set_index('iata_code')['name'])

Output:

enter image description here

If the frame now has too many columns and you want something cleaner and to reorder the columns try final_df = df[['a', 'b', 'c', 'd']] where abc are the columns and ordering you would like.

Post a Comment for "How To Clean Data So That The Correct Arrival Code Is There For The City Pair?"