프로젝트,실습

자동차 시세 시각화&모델링

mjmjpp 2024. 3. 17. 07:00

자동차 정보를 이용해서 price를 예측하자!

#인공지능을 이용해서 price예측하기

데이터안의 패턴을 인공지능에게 학습시키기

인공지능이 데이터 안에서 패턴을 찾는 과정!

0. 데이터 파악은 기본 -> 변수의 특성은 항상 미리 파악하고 들어가기

1.가격을 예측할때 중요한 특성과 중요하지 않은 특성을 찾아보자.

- make와 같은 제조사는 중요할까??=> 외제차와 국제차는 차이가 있다

-자동차의 문의 갯수는 중요할까-> 가격에 미치는 영향이 있을까??=> 두개는 슈퍼카일 확률 높음..그렇다면 가격에 미치는 영향있을수도

-자동차의 높이는 가격에 영향을 미칠까??=> 아닐 거 같다.. suv가 세단 보다 높다..그러면 더 비싼가?? 이걸로는 판단하기 쉽지 않을듯..

-자동차를 잘 아는 사람!-> 배경지식으로도 추측이 가능하다. 전문성을 가진 사람이 필요. 전문가의 인사이트가 중요함

----> 인공지능 만들때는 전문성이 중요함..인사이트 키우는게 중요

2. 인공지능을 만들기 전에 시각화를 해보자!

# 데이터 불러오기

#자동차 시세 예측
import pandas as pd
df = pd.read_csv('C:/Users/Automobile price.csv')
df

import matplotlib.pyplot as plt
plt.scatter(df['length'],df['price'] )

plt.title('length,price')
plt.grid(True)
plt.ylabel('price')
plt.xlabel('length')
plt.show()

length와 price의 관계를 알 수 있음

3. 가격예측에 중요한 영향을 미치는 열이 무엇인지 알아보고 싶다!

*상관계수 확인해보기

상관 계수 (Correlation Coefficient): 두 변수 간의 선형적 상관 관계를 측정합니다. 일반적으로 피어슨 상관 계수가 사용되며, -1에서 1 사이의 값을 가집니다. 0에 가까울수록 상관성이 낮고, 1 또는 -1에 가까울수록 강한 상관성을 나타냅니다.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 주어진 열들을 포함한 데이터프레임 생성
selected_columns = ['symboling','wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size','bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price']
df_selected = df[selected_columns]

# 상관계수 행렬 계산
correlation_matrix = df_selected.corr()

# 히트맵 그리기
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Selected Columns')
plt.show()

*price에 영향을 미치는 변수들과 실제 price간의 상관관계를 추출해서 알아보기

#상관관계 행렬계산
corr_matrix=df.corr()
#'price'열과의 상관관게가 높은 열 확인
high_corr_features= corr_matrix['price'].abs().sort_values(ascending=False)
print(high_corr_features)

price                1.000000
engine-size          0.872335
curb-weight          0.834415
horsepower           0.810533
width                0.751265
highway-mpg          0.704692
length               0.690628
city-mpg             0.686571
wheel-base           0.584642
bore                 0.543436
normalized-losses    0.203254
height               0.135486
peak-rpm             0.101649
symboling            0.082391
stroke               0.082310
compression-ratio    0.071107
Name: price, dtype: float64

- > 위에있을 수록 price와 상관관계가 높다!

->0.8정도면 상관관계가 있다! 라고 볼 수 있음..

*상관관계가 높은 상위 3가지 변수 vs 상관관계가 낮은 하위 3가지 변수 를 각각 그려보기

import matplotlib.pyplot as plt

# 1x3의 그래프 격자 생성
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# 각각의 산점도 그리기
axs[0].scatter(df['engine-size'], df['price'])
axs[0].set_xlabel('Engine Size')
axs[0].set_ylabel('Price')
axs[0].set_title('Price vs Engine Size')

axs[1].scatter(df['curb-weight'], df['price'])
axs[1].set_xlabel('Curb Weight')
axs[1].set_ylabel('Price')
axs[1].set_title('Price vs Curb Weight')

axs[2].scatter(df['engine-size'], df['price'])
axs[2].set_xlabel('Engine Size')
axs[2].set_ylabel('Price')
axs[2].set_title('Price vs Engine Size')

# 레이아웃 조정
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt

# 1x3의 그래프 격자 생성
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# 각각의 산점도 그리기
axs[0].scatter(df['symboling'], df['price'])
axs[0].set_xlabel('Symboling')
axs[0].set_ylabel('Price')
axs[0].set_title('Price vs Symboling')

axs[1].scatter(df['stroke'], df['price'])
axs[1].set_xlabel('Stroke')
axs[1].set_ylabel('Price')
axs[1].set_title('Price vs Stroke')

axs[2].scatter(df['compression-ratio'], df['price'])
axs[2].set_xlabel('Compression Ratio')
axs[2].set_ylabel('Price')
axs[2].set_title('Price vs Compression Ratio')

# 레이아웃 조정
plt.tight_layout()
plt.show()

결론 -> 관계가 있어보이는 것 존재 위의 세개는 가격예측에 중요한 요소

but 밑의 세개는 가격예측에 중요한 요소가 아님...

자동차 시세 모델링

인공지능 개발 과정
1.데이터 정제
- 결측치 제거
- 문자형 자료를 숫자형으로 변환
2. 모델 개발
3. 모델 평가

1. 데이터 정제

- 결측치 제거

# 각 열의 결측치 개수 확인
missing_values = df.isnull().sum()

# 결측치 개수 출력
print("각 열의 결측치 개수:")
print(missing_values)

각 열의 결측치 개수:
symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-ratio     0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

#모든 열의 결측치 값을 삭제
df_cleaned = df.dropna()

# 각 열의 결측치 개수 확인
missing_values =df_cleaned.isnull().sum()

# 결측치 개수 출력
print("각 열의 결측치 개수:")
print(missing_values)

각 열의 결측치 개수:
symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

I

-문자형 자료들을 숫자로 바꾸고자 함

1) 모든 열들에 대해 각 열의 자료형을 출력하기

# 각 열의 데이터 유형 확인
data_types = df.dtypes
print(data_types)

symboling              int64
normalized-losses    float64
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

2) 인코딩을 수행하는데 이때 레이블 인코딩과 원핫 인코딩 중에 골라서 사용

레이블 인코딩(Label Encoding):
- 특징: 각 범주형 값에 고유한 숫자를 할당하여 변환하는 방식입니다. 예를 들어, 성별을 남성은 0, 여성은 1로 인코딩하는 것입니다.
- 쓰임새: 주로 범주형 변수의 값들이 순서를 가지는 경우에 사용됩니다. 예를 들어, 학력 수준을 초등학교는 0, 중학교는 1, 고등학교는 2와 같이 순서를 가지는 값으로 인코딩할 때 사용됩니다.
원-핫 인코딩(One-Hot Encoding):
- 특징: 각 범주형 값에 대해 새로운 이진 특성을 생성하여 변환하는 방식입니다. 각 범주에 대한 새로운 열을 만들고 해당하는 값에는 1을, 나머지에는 0을 할당합니다.
- 쓰임새: 주로 범주형 변수의 값들이 순서를 가지지 않고 독립적인 경우에 사용됩니다. 예를 들어, 도시 이름이나 국가 이름과 같은 범주형 변수를 인코딩할 때 사용됩니다.

#문자형 자료를 숫자형으로 바꾸기

# get_dummies 함수를 사용하여 원핫 인코딩 적용
encoded_df = pd.get_dummies(df_cleaned, 
columns=['make', 'fuel-type', 'aspiration','num-of-doors', 'body-style', 'drive-wheels', 'engine-location','engine-type','num-of-cylinders','fuel-system'])
encoded_df.head()

#원핫인코딩 결과로 열의 갯수가 늘어남

encoded_df.columns

Index(['symboling', 'normalized-losses', 'wheel-base', 'length', 'width',
       'height', 'curb-weight', 'engine-size', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price', 'make_audi', 'make_bmw', 'make_chevrolet',
       'make_dodge', 'make_honda', 'make_jaguar', 'make_mazda',
       'make_mercedes-benz', 'make_mitsubishi', 'make_nissan', 'make_peugot',
       'make_plymouth', 'make_porsche', 'make_saab', 'make_subaru',
       'make_toyota', 'make_volkswagen', 'make_volvo', 'fuel-type_diesel',
       'fuel-type_gas', 'aspiration_std', 'aspiration_turbo',
       'num-of-doors_four', 'num-of-doors_two', 'body-style_convertible',
       'body-style_hardtop', 'body-style_hatchback', 'body-style_sedan',
       'body-style_wagon', 'drive-wheels_4wd', 'drive-wheels_fwd',
       'drive-wheels_rwd', 'engine-location_front', 'engine-type_dohc',
       'engine-type_l', 'engine-type_ohc', 'engine-type_ohcf',
       'engine-type_ohcv', 'num-of-cylinders_eight', 'num-of-cylinders_five',
       'num-of-cylinders_four', 'num-of-cylinders_six',
       'num-of-cylinders_three', 'fuel-system_1bbl', 'fuel-system_2bbl',
       'fuel-system_idi', 'fuel-system_mfi', 'fuel-system_mpfi',
       'fuel-system_spdi'],
      dtype='object')

data_types =encoded_df.dtypes
print(data_types)

symboling              int64
normalized-losses    float64
wheel-base           float64
length               float64
width                float64
                      ...   
fuel-system_2bbl       uint8
fuel-system_idi        uint8
fuel-system_mfi        uint8
fuel-system_mpfi       uint8
fuel-system_spdi       uint8
Length: 65, dtype: object

2. 모델 개발

-인코딩된 데이터에 linear regression 모델을 활용해서 price를 예측하는 모델을 만들자.

->뭘예측할꺼냐??

->어떤 모델을 사용해서 뭘예측할거지를 명확히 !

Linear Regression 모델을 생성하고 훈련 데이터로 학습된후 테스트 데이터를 사용하여 예측 수행
모델의 성능은 평균 제곱오차( mean squared error, mse)와 결정계수(R-squared)로 평가

MSE: 예측값과 실제 값간의 차이를 제곱하여 평균한 값-> 낮을 수록 좋은 모델
R-squared : 모델이 데이터를 얼마나 잘 설명하는지를 나타내는 값-> 1에 가까울수록 좋은 모델
MAE(mean_absolute_error): 예측값과 실제값 사이의 절대적인 오차의 평균을 나타내는 회귀 모델의 성능 지표
- > MAE 값이 작을수록 모델이 예측을 더 잘 수행

*Linear Regression 모델을 활용해서 데이터 가격을 예측하였지만 다른 모델을 활용해서 성능을 비교해 볼 수 있음

from sklearn.metrics import r2_score

# 독립변수와 종속변수 분리
X = encoded_df.drop(columns=['price'])  # 종속변수를 제외한 나머지 변수들을 독립변수로 설정
y = encoded_df['price']  # 예측할 종속변수

# 훈련 데이터와 테스트 데이터로 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 선형 회귀 모델 학습
model = LinearRegression()
model.fit(X_train, y_train)

#회귀식 확인하기 
model.intercept_
model.coef_


# 테스트 데이터에 대한 예측
y_pred = model.predict(X_test)

# 모델 성능 평가
mse = mean_squared_error(y_test, y_pred)
r2=r2_score(y_test,y_pred)

print("Mean Squared Error:", mse)
print("R Squared:",r2)

Mean Squared Error: 3315323.6693646433
R Squared: 0.8136896546524797

3.모델 성능 평가

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error 

# 모델 성능 평가
mse = mean_squared_error(y_test, y_pred)
mae=mean_absolute_error(y_test,y_pred)
r2=r2_score(y_test,y_pred)


print("Mean Squared Error:", mse)
print("R Squared:",r2)
print("mean_absolute_error",mae)

Mean Squared Error: 3315323.6693646433
R Squared: 0.8136896546524797
mean_absolute_error 1414.5569315890496

mean_absolute_error: 예측값과 실제값 사이의 절대적인 오차의 평균을 나타내는 회귀 모델의 성능 지표

- > MAE 값이 작을수록 모델이 예측을 더 잘 수행

mean_absolute_error 1414.5569315890496

=>예측했을때의 가격이 실제 가격과의 오차가 대략적으로 1400달러정도 난 다는 것!

->우리나라 돈으로 200만원 정도 예측의 차이가 있다

->크게 나쁘지 않다... 억단위의 자동차도 많은데 200만원정도의 차이면 괜찮은 것 같다..

->오차가 낮을 수록 좋지만! 어디서 왜쓸건지에 따라 달라진다.

1)매우 확실한 값을 원한다-> 200만원의 차이는 클것,,..

2)나는 트렌드 파악을 원한다-> 200만원의 차이는 크지 않음