파이썬 비즈니스 데이터 분석-코호트 분석

mjmjpp 2024. 6. 10. 08:38

코호트분석

import pandas as pd
df = pd.read_excel('C:/pmj/Online Retail.xlsx')
df

InvoiceNo: 송장번호. 해당 거래에 할당된 6자리 정수 이 코드가 문자 'c'로 시작하면 취소를 나타냅니다.
StockCode: 제품 코드. 각 고유 제품에 고유하게 할당된 5자리 정수
Description: 제품 이름
Quantity: 거래당 각 제품의 수량 이 코드가 ‘-’(마이너스)로 시작하면 취소를 나타냅니다.
InvoiceDate: 송장 날짜 및 시간. 숫자, 각 거래가 생성된 날짜 및 시간
UnitPrice: 단가. 숫자, 스털링(영국 화폐) 단위의 제품 가격
CustomerID: 고객 번호. 해당 고객에게 고유하게 할당된 5자리 정수
Country: 국가 이름. 해당 고객이 거주하는 국가의 이름

#최근 구매일- 최초 구매일=> 첫구매후  몇달 후 구매인지를 알 수 있음
#최초 구매일 (invoicedatefirst)를 구하기
df["invoicedatefirst"]= df.groupby(['CustomerID'])['InvoiceDate'].transform('min')

#데이터 기초통계 전처리
df.describe()

 #데이터 전처리 
 df_fil = df[(df['Quantity'] >= 0) & (df['UnitPrice'] >= 0)]
df_fil

#최초 구매일 열 만들기
# 데이터프레임의 복사본을 만들어서 조작
df_copy = df_fil.copy()
df_copy['InvoiceDateFirst'] = df_copy.groupby(['CustomerID'])['InvoiceDate'].transform('min')
df_copy

df_copy[['CustomerID','InvoiceDate','InvoiceDateFirst']].sample(5)

첫 구매일로부터 몇달째 구매인가?

#연도별 차이(year_diff)와 월별 차이(month_diff)를 구하기

year_diff= df_copy['InvoiceDate'].dt.year-df_copy['InvoiceDateFirst'].dt.year
month_diff= df_copy['InvoiceDate'].dt.month-df_copy['InvoiceDateFirst'].dt.month

year_diff.value_counts()

0.0    254752
1.0    143172
dtype: int64

#연도차이 *12개월 +월차이 +1'로 첫구매후 몇달후 구매인지 알 수 있도록 cohortindex변수를 생성합니다.
#2010-12-01부터 2011 -12-01의 데이터를 기반으로 진행되어 cohortindex 변수의 최소 값은 1이며 최대값은 13이다.
df_copy['cohortindex']=year_diff*12+month_diff+1

코호트 월별 빈도수

#corhortindex값으로 월별 잔존 구매에 대한 빈도수를 구합니다.
#회원가입후 월별 구매 빈도수를 value_counts로 구합니다.

df_copy['cohortindex'].value_counts()

1.0     118764
2.0      27868
4.0      27256
3.0      26991
6.0      26964
5.0      25426
7.0      23705
8.0      23604
12.0     23457
10.0     23230
9.0      23011
11.0     20394
13.0      7254
Name: cohortindex, dtype: int64

df_copy

result = df_copy.loc[df_copy['CustomerID'] ==12680.0]
result.describe()

#countplot으로 cohortindex의 빈도수를 시각화합니다.
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12,4))
sns.countplot(data=df_copy,x='cohortindex')

잔존 빈도 구하기

df_copy['InvoiceDateFirstYM']=df_copy['InvoiceDateFirst'].astype(str).str[:7]
df_copy

#invoicedatefirst,cohortindex로 그룹화하여 customerid의 유일값에 대한 빈도수를 구합니다.
#cohort_count
cohort_count=df_copy.groupby(['InvoiceDateFirstYM','cohortindex'])['CustomerID'].nunique().unstack()
cohort_count

해석해보기!!
12월에 처음 구매한 고객이 885명 //13번째 달에 235이 남아있음///
2011년 4월에 9개월째에 22명 남음
1번째를 보면 신규 유치한 고객의 수를 알 수 잇음

첫구매자수를 기준으로 잔존하고 있는 구매자수 히트 맵 작성

# 히트맵을 통해 위에서 구한 잔존수를 시각화함
plt.figure(figsize=(12,8))
sns.heatmap(cohort_count,cmap='Greens',annot=True, fmt=".0f")

월별 신규 유입 고객수

acqusition

#cohort_count
cohort_count[1]

InvoiceDateFirstYM
2010-12    885.0
2011-01    417.0
2011-02    380.0
2011-03    452.0
2011-04    300.0
2011-05    284.0
2011-06    242.0
2011-07    188.0
2011-08    169.0
2011-09    299.0
2011-10    358.0
2011-11    324.0
2011-12     41.0

#월별 신규고객 유입수 차트 만들기
cohort_count[1].plot(kind='bar',figsize=(12,3),rot=0,title='Monthly acqusition')

잔존율 구하기

#가입한 달을 1로 나누면 잔존율을 구할 수 있다.
#div를 통해 구하면 axis=0으로 설정하여 첫달을 기준으로 나머지 달을 나누게 된다.
#cohort_norm
cohort_norm=cohort_count.div(cohort_count[1],axis=0)
cohort_norm

#히트맵을 통해 잔존율도 시각화
plt.figure(figsize=(12,8))
sns.heatmap(cohort_norm,cmap='Greens',annot=True)

월별 매출액 리텐션

df_copy.head(1)
df_copy['totalprice'] = df_copy['Quantity'] * df_copy['UnitPrice']
df_copy

corhort_total_price=df_copy.groupby(['InvoiceDateFirstYM','cohortindex'])['totalprice'].sum().unstack()

#히트맵을 통해 매출액도 시각화
plt.figure(figsize=(20,6))
sns.heatmap(corhort_total_price,cmap='Greens',annot=True,fmt=',.0f')

#고객이 얼마나 남아있냐. 남아있는 고객들이 어떻게 매출을 내고 있나 파악

참고

https://youtu.be/rj_AgVXkBzQ?feature=shared