EDU_Day5(kor)

공부/기타

EDU_Day5(kor)

Smallghost 2024. 2. 28. 03:00

오늘부터 데이터 분석에 관한 실습을 진행하게 되어, 새 데이터셋으로 어떻게 잘 해보려했는데, 잘 안되어서 약간 당황스럽지만 일단 기록한다.

범죄자 가족상황과 재전과범률과의 상관관계를 알아보기 위해 공공포털에 들어가서 '경찰청_범죄자 생활정도, 혼인관계 및 부모관계_12_31_2020'와 '경찰청_범죄자 범행시 전과 및 재범여부_12_31_2020' 두개를 얻어왔다.

0. 환경준비

#라이브러리 불러오기
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1. 데이터 로딩 : 범죄 관련 데이터

변수명	설명	구분
Ex-conviction (subordinate)	전과(소계)	Target
Level of living (downstream)	생활정도(하)	Feature
Quality of living (middle class)	생활정도(중)	Feature
Standard of living (upper)	생활정도(상)	Feature
Quality of living (unknown)	생활정도(미상)	Feature
Major category of Crime	범죄대분류	Feature
Crime classification	범죄중분류	Feature
Unknown	미상	Feature

# 데이터 로딩 및 설명

# 경찰청_범죄자 생활정도, 혼인관계 및 부모관계_12_31_2020 (출처 URL : https://www.data.go.kr/tcs/dss/selectDataSetList.do?keyword=%EB%B2%94%EC%A3%84%EC%9E%90&conditionType=search&org=&orgFilter=&orgFullName=)

path = 'National_Police_Agency_Criminal_living_standards,_marital_relationships,_and parental_relationships_12_31_2020.csv'
temp1 = pd.read_csv(path)
temp1.head()

	Major category of Crime	Crime classification	Standard of living (in the world)	Level of living (downstream)	Quality of living (middle class)	Standard of living (upper)	Quality of living (unknown)	Marital relationship (in line)	Marriage relationship (subordinate)	Marital relationship (spouse)	...	Single parent relationship (real (positive) parent)	Single parent relationship (step-parent)	Single parent relationship (parental mother)	Single parent relationship (parent mother)	Single parent relationship (mother-in-law)	Parental relationship of unmarried persons (mother-in-law)	Parental relationship of unmarried persons (Stepmother-in-law)	Single parent relationship (non-parent)	Unknown
0	Brutal Crime	Murder	341	172	72	6	91	341	158	81	...	45	0	2	8	0	17	0	20	91
1	Brutal Crime	Attempted Murder	454	276	93	3	82	454	234	123	...	82	1	0	10	1	19	1	24	82
2	Brutal Crime	Theft	1202	713	327	15	147	1202	257	144	...	615	2	8	37	8	69	3	56	147
3	Brutal Crime	Rape	6113	1831	1987	78	2217	6113	1295	798	...	1998	8	23	143	19	270	2	139	2216
4	Brutal Crime	Pseudo-Rape	934	331	362	7	234	934	236	145	...	352	1	4	21	3	55	1	29	232

5 rows × 24 columns

# 경찰청_범죄자 범행시 전과 및 재범여부_12_31_2020(출처 URL : https://www.data.go.kr/tcs/dss/selectDataSetList.do?keyword=%EB%B2%94%EC%A3%84%EC%9E%90&conditionType=search&org=&orgFilter=&orgFullName=)

path = 'National_Police_Agency_whether_criminal_offense_is_committed_and_recidivism_12_31_2020.csv'
temp2 = pd.read_csv(path)

temp2.head()

	Major category of Crime	Crime classification	None	Ex-conviction (subordinate)	Ex-conviction (1 criminal)	Ex-conviction (2 criminal)	Ex-conviction (3 criminal)	Ex-conviction (4 criminal)	Ex-conviction (5 criminal)	Ex-conviction (6 criminal)	Ex-conviction (7 criminal)	Ex-conviction (8 criminal)	Ex-conviction (9 or more)	Unknown
0	Brutal Crime	Murder	87	163	33	19	9	20	11	9	8	3	51	91
1	Brutal Crime	Attempted Murder	127	245	45	29	19	23	17	15	13	10	74	82
2	Brutal Crime	Theft	310	745	126	93	73	55	39	29	32	22	276	147
3	Brutal Crime	Rape	1573	2361	599	374	244	189	149	111	101	64	530	2179
4	Brutal Crime	Pseudo-Rape	331	379	101	59	42	34	25	17	17	6	78	224

일단 임시 분석실습용이므로 범죄자 생활정도에 따른 소계 변화를 분석하고자 한다.

#데이터 변형
temp1_add = temp1[['Major category of Crime','Crime classification',
                   'Level of living (downstream)','Quality of living (middle class)',
                   'Standard of living (upper)', 'Quality of living (unknown)']]
temp2_add = temp2['Ex-conviction (subordinate)']
data = pd.concat([temp1_add, temp2_add], axis = 1, join = 'inner')

data.to_csv('C:/Users/User/Python/P2/temp.csv') #데이터 확인을 위해 csv로 저장

단변량분석 : 숫자형 변수

숫자형 변수를 분석하는 함수 생성

def eda_num(data, var, bins=38) : 

    #기초 통계량
    print(('<< 기초통계량 >>'))
    display(data[[var]].describe().T)
    print('='*100)

    #시각화
    print('<< 그래프 >>')
    plt.figure(figsize=(10,6))

    plt.subplot(2,1,1)
    sns.histplot(data[var], bins = bins, kde = True)
    plt.grid()

    plt.subplot(2,1,2)
    sns.boxplot(x = data[var])
    plt.grid()
    plt.show()

(1) Ex-conviction (subordinate) - 전과(소계) | Target |(:건)

var = 'Ex-conviction (subordinate)'

기초 통계량 및 분포 확인

eda_num(data, var)

<< 기초통계량 >>

	count	mean	std	min	25%	50%	75%	max
Ex-conviction (subordinate)	38.0	17260.526316	33549.166399	19.0	353.5	2169.0	12859.75	143557.0

====================================================================================================
<< 그래프 >>

기초 통계량과 분포를 통해서 파악한 내용

1년동안 최소 19건, 최대 143557건의 범죄가 일어났다.
특정 종류의 범죄가 특히 많이 일어난 이유가 있는가?
대체적으로 2만건을 넘지 않는다. -> 이유가 있을까?

추가 분석사항?

관련사항이 있을까..?

Level of living (downstream) - 생활정도(하)| Feature | (:명)

var = 'Level of living (downstream)'

eda_num(data, var)

<< 기초통계량 >>

	count	mean	std	min	25%	50%	75%	max
Level of living (downstream)	38.0	15193.394737	29548.062724	9.0	289.75	1778.5	11313.75	127215.0

====================================================================================================
<< 그래프 >>

기초 통계량과 분포를 통해서 파악한 내용

최소 9명, 최대 12만명의 범죄

추가 분석 사항?

단변량분석 : 범주형 변수

함수생성

def eda_cat(data, var) :
    t1 = data[var].value_counts()
    t2 = data[var].value_counts(normalize = True)
    t3 = pd.concat([t1, t2], axis = 1)
    t3.columns = ['count','ratio']
    display(t3)
    sns.countplot(x = var, data = data)
    plt.xticks(rotation = 90)
    plt.show()

(3) Major category of Crime - 범죄대분류 | Target |

var = 'Major category of Crime'

기초 통계량 및 분포 확인

eda_cat(data, var)

	count	ratio
Major category of Crime
Intelligent Crime	9	0.236842
Brutal Crime	8	0.210526
Violent Crime	8	0.210526
Moral Crime	2	0.052632
Theft	1	0.026316
Special Economic Crime	1	0.026316
Drug Crime	1	0.026316
Health Crime	1	0.026316
Environmental Crime	1	0.026316
Traffic Crime	1	0.026316
Labor Crime	1	0.026316
Security Crime	1	0.026316
Election Crime	1	0.026316
Military Crime	1	0.026316
Other Crimes	1	0.026316

기초 통계량과 분포를 통해서 파악한 내용

가장 높은 비율을 차지하는 강력범죄, 폭력범죄, 지능범죄는 각각 21%, 21%, 23%를 차지한다.
20년 당시에 가장 핫했던 범죄트렌드가 이 3종류는 아닐까? -> 왜 늘게 되었을까?

추가 분석사항?