Lazypredict를 이용해 ML모델 선택하기(Regression)

Python AutoML 라이브러리 중 하나인 Lazypredict를 이용해 여러 ML 모델들을 동시에 학습하고, 예측 성능을 비교해보자.

Lazypredict

Lazy Predict는 인도의 어느 시니어 데이터 사이언티스트인 Shankar Rao Pandala라는 개인이 개발한 오픈소스 머신러닝 자동화 관련 파이썬 오픈소스 프로젝트이다. 현재는 Classification과 Regression에 대한 기능만 제공되고 있다. Lazypredict를 이용하면 코드 한 줄로 여러 ML 모델을 불러와 학습시킬 수 있고, 추론 결과도 확인할 수 있다. 여러 모델들의 성능 지표도 비교할 수 있어 성능이 더 좋은 모델을 가려낼 수도 있다. 다만, 파라미터를 조정하는 기능은 따로 제공되지 않는다는 한계가 있다.

예제 데이터 로드

예제 데이터로 와인 품질 데이터를 사용한다. 예측하고자 하는 값은 quality이고, 0 ~ 10 사이의 값을 갖는다. 데이터셋에 대한 자세한 설명은 앞의 링크를 참조하면 된다. 데이터 내의 여러 속성들을 바탕으로 와인의 품질 점수를 예측하는 Regression 모델을 만들어보자. 아래와 같이 우선 데이터를 로드한다.

import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, sep=";")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

df.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

df.tail()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
1594	6.2	0.600	0.08	2.0	0.090	32.0	44.0	0.99490	3.45	0.58	10.5	5
1595	5.9	0.550	0.10	2.2	0.062	39.0	51.0	0.99512	3.52	0.76	11.2	6
1596	6.3	0.510	0.13	2.3	0.076	29.0	40.0	0.99574	3.42	0.75	11.0	6
1597	5.9	0.645	0.12	2.0	0.075	32.0	44.0	0.99547	3.57	0.71	10.2	5
1598	6.0	0.310	0.47	3.6	0.067	18.0	42.0	0.99549	3.39	0.66	11.0	6

quality 컬럼 분할

y_data = df.pop("quality")
x_data = df

print("Shape of X: ", x_data.shape)
print("Shape of Y: ", y_data.shape)

Shape of X:  (1599, 11)
Shape of Y:  (1599,)

학습/검증 데이터 분할

from sklearn.model_selection import train_test_split

# 학습/검증 데이터 분할
x_train, x_test, y_train, y_test = train_test_split(
    x_data, 
    y_data, 
    test_size=0.2, 
    random_state=42)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(1279, 11) (1279,)
(320, 11) (320,)

Lazypredict를 통한 자동 모델 학습

LazyClassifier를 이용해 Lazypredict의 지도학습 분류기 인스턴스 clf를 만든다. 이렇게 만든 clf의 fit 메서드의 인자료 위에서 분할한 학습 데이터와 검증 데이터를 입력한다. 그 결과로 models와 predictions가 반환된다. models는 Scikit-Learn의 여러 분류 모델을 적용해 학습한 결과 데이터프레임이고, predictions은 각 모델 별 예측값을 모아둔 데이터프레임이다.

from lazypredict.Supervised import LazyRegressor

reg = LazyRegressor(verbose=0, predictions=True)

models, predictions = reg.fit(x_train, x_test, y_train, y_test)

100%|███████████████████████████████████████████████████████████████████████████████| 41/41 [00:03<00:00, 12.45it/s]

models

	Adjusted R-Squared	R-Squared	RMSE	Time Taken
Model
ExtraTreesRegressor	0.54	0.55	0.54	0.22
RandomForestRegressor	0.52	0.54	0.55	0.40
LGBMRegressor	0.48	0.50	0.57	0.05
HistGradientBoostingRegressor	0.48	0.49	0.58	0.46
XGBRegressor	0.47	0.49	0.58	0.10
BaggingRegressor	0.47	0.49	0.58	0.05
NuSVR	0.45	0.46	0.59	0.06
SVR	0.44	0.46	0.59	0.06
GradientBoostingRegressor	0.43	0.45	0.60	0.15
TransformedTargetRegressor	0.38	0.40	0.62	0.01
LinearRegression	0.38	0.40	0.62	0.01
Lars	0.38	0.40	0.62	0.01
Ridge	0.38	0.40	0.62	0.01
RidgeCV	0.38	0.40	0.62	0.01
BayesianRidge	0.38	0.40	0.62	0.01
SGDRegressor	0.38	0.40	0.63	0.01
LassoLarsIC	0.38	0.40	0.63	0.01
HuberRegressor	0.38	0.40	0.63	0.02
LassoCV	0.38	0.40	0.63	0.05
ElasticNetCV	0.38	0.40	0.63	0.05
LassoLarsCV	0.38	0.40	0.63	0.02
AdaBoostRegressor	0.37	0.39	0.63	0.10
PoissonRegressor	0.37	0.39	0.63	0.01
LarsCV	0.37	0.39	0.63	0.02
OrthogonalMatchingPursuitCV	0.37	0.39	0.63	0.01
LinearSVR	0.36	0.38	0.63	0.01
MLPRegressor	0.33	0.35	0.65	1.08
KNeighborsRegressor	0.31	0.33	0.66	0.02
TweedieRegressor	0.30	0.32	0.67	0.01
GammaRegressor	0.30	0.32	0.67	0.01
OrthogonalMatchingPursuit	0.21	0.24	0.71	0.01
RANSACRegressor	0.07	0.10	0.77	0.05
DecisionTreeRegressor	0.03	0.06	0.78	0.01
ElasticNet	-0.04	-0.01	0.81	0.01
DummyRegressor	-0.04	-0.01	0.81	0.01
Lasso	-0.04	-0.01	0.81	0.01
LassoLars	-0.04	-0.01	0.81	0.01
ExtraTreeRegressor	-0.09	-0.05	0.83	0.01
PassiveAggressiveRegressor	-0.18	-0.14	0.86	0.01
GaussianProcessRegressor	-3.42	-3.27	1.67	0.10
KernelRidge	-50.33	-48.56	5.69	0.05

predictions.head()

	AdaBoostRegressor	BaggingRegressor	BayesianRidge	DecisionTreeRegressor	DummyRegressor	ElasticNet	ElasticNetCV	ExtraTreeRegressor	ExtraTreesRegressor	GammaRegressor	...	RANSACRegressor	RandomForestRegressor	Ridge	RidgeCV	SGDRegressor	SVR	TransformedTargetRegressor	TweedieRegressor	XGBRegressor	LGBMRegressor
0	5.47	5.10	5.34	6.00	5.62	5.62	5.34	6.00	5.30	5.40	...	5.77	5.30	5.35	5.34	5.34	5.44	5.35	5.41	5.32	5.34
1	5.41	5.20	5.07	5.00	5.62	5.62	5.09	5.00	5.19	5.31	...	5.66	5.22	5.06	5.06	5.09	5.09	5.06	5.31	4.81	5.11
2	5.61	5.30	5.65	5.00	5.62	5.62	5.63	5.00	5.41	5.55	...	5.79	5.46	5.66	5.66	5.66	5.60	5.66	5.56	4.78	5.02
3	5.20	5.00	5.46	5.00	5.62	5.62	5.46	7.00	5.31	5.48	...	4.79	5.19	5.46	5.46	5.46	5.33	5.46	5.49	5.20	5.34
4	5.76	6.00	5.73	6.00	5.62	5.62	5.73	6.00	6.00	5.69	...	5.73	6.00	5.73	5.73	5.74	5.71	5.73	5.70	6.00	5.94

5 rows × 41 columns

단일 모델 선택

여러 성능 지표 중 RMSE 값이 가장 작은 모델 중 하나를 고르면, ExtraTreesRegressor 모델이 선택된다.

models.loc[models['RMSE'] == models['RMSE'].min()].index[0]

'ExtraTreesRegressor'

해당 모델을 직접 불러와 직접 학습을 진행한 후 검증 데이터에 대해 예측값을 생성해보자.

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import r2_score, mean_squared_error

# 학습
etr = ExtraTreesRegressor(n_estimators=100,random_state=0).fit(x_train, y_train)

# 예측값
y_pred = etr.predict(x_test)

검증 데이터에 대한 모델의 예측 성능을 평가해보면, 위에서 Lazypredict를 이용한 결과와 유사한 것을 알 수 있다.

# R-Squared
r_squared = r2_score(y_test, y_pred)
# Adjusted R-Squared
def adjusted_rsquared(r2, n, p):
    return 1 - (1 - r2) * ((n - 1) / (n - p - 1))
adj_rsquared = adjusted_rsquared(
    r_squared, x_test.shape[0], x_test.shape[1]
)
# RMSE
import numpy as np
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"""
R-Squared: {r_squared:.2f}
Adjusted R-Squared: {adj_rsquared:.2f}
RMSE: {rmse:.2f}
""")

R-Squared: 0.55
Adjusted R-Squared: 0.54
RMSE: 0.54

표준화 후 결과 비교

데이터 표준화 후 Lazypredict를 사용하면 학습 결과에 영향을 줄 것으로 생각하고, 테스트를 진행했지만, 사실 결과는 위에서 진행한 것과 같다. 그 이유는 Lazypredict의 내부 소스코드에 이미 StandardScaler가 적용되어 있기 때문이다. 이외에 모델의 성능을 향상시키기 위해서 Feature Engineering이나 파라미터를 튜닝하는 방식 등을 추가적으로 적용해 볼 수 있겠다.

from sklearn.preprocessing import StandardScaler

# 학습 데이터 기준으로 표준화
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

print(x_train_scaled.shape, y_train.shape)
print(x_test_scaled.shape, y_test.shape)

(1279, 11) (1279,)
(320, 11) (320,)

아래 결과를 참고하면 스케일러에 x_train의 각 컬럼의 평균과 표준편차가 저장된 것을 확인할 수 있다.

pd.DataFrame({"scaler_mean": scaler.mean_, "scaler_std": scaler.scale_})

	scaler_mean	scaler_std
0	8.32	1.72
1	0.53	0.18
2	0.27	0.20
3	2.56	1.44
4	0.09	0.05
5	15.88	10.31
6	46.66	32.93
7	1.00	0.00
8	3.31	0.15
9	0.66	0.17
10	10.42	1.05

x_train.describe().T[['mean','std']]

	mean	std
fixed acidity	8.32	1.72
volatile acidity	0.53	0.18
citric acid	0.27	0.20
residual sugar	2.56	1.44
chlorides	0.09	0.05
free sulfur dioxide	15.88	10.31
total sulfur dioxide	46.66	32.94
density	1.00	0.00
pH	3.31	0.15
sulphates	0.66	0.17
alcohol	10.42	1.05

from lazypredict.Supervised import LazyRegressor

reg = LazyRegressor(verbose=0, predictions=True)

models, predictions = reg.fit(x_train_scaled, x_test_scaled, y_train, y_test)

100%|███████████████████████████████████████████████████████████████████████████████| 41/41 [00:03<00:00, 12.51it/s]

표준화된 데이터를 Lazypredict 내부에서 한 번 더 표준화하므로 사실 표준화를 하지 않은 데이터를 사용한 것과 결과가 동일한 것을 알 수 있다.

models

	Adjusted R-Squared	R-Squared	RMSE	Time Taken
Model
ExtraTreesRegressor	0.54	0.55	0.54	0.22
RandomForestRegressor	0.52	0.54	0.55	0.41
LGBMRegressor	0.48	0.50	0.57	0.04
HistGradientBoostingRegressor	0.48	0.49	0.58	0.47
XGBRegressor	0.47	0.49	0.58	0.10
BaggingRegressor	0.47	0.49	0.58	0.05
NuSVR	0.45	0.46	0.59	0.06
SVR	0.44	0.46	0.59	0.06
GradientBoostingRegressor	0.43	0.45	0.60	0.15
TransformedTargetRegressor	0.38	0.40	0.62	0.01
LinearRegression	0.38	0.40	0.62	0.00
Lars	0.38	0.40	0.62	0.01
Ridge	0.38	0.40	0.62	0.01
RidgeCV	0.38	0.40	0.62	0.01
BayesianRidge	0.38	0.40	0.62	0.01
SGDRegressor	0.38	0.40	0.63	0.01
LassoLarsIC	0.38	0.40	0.63	0.01
HuberRegressor	0.38	0.40	0.63	0.02
LassoCV	0.38	0.40	0.63	0.05
ElasticNetCV	0.38	0.40	0.63	0.05
LassoLarsCV	0.38	0.40	0.63	0.02
AdaBoostRegressor	0.37	0.39	0.63	0.10
PoissonRegressor	0.37	0.39	0.63	0.01
LarsCV	0.37	0.39	0.63	0.02
OrthogonalMatchingPursuitCV	0.37	0.39	0.63	0.01
LinearSVR	0.36	0.38	0.63	0.02
MLPRegressor	0.33	0.35	0.65	1.09
KNeighborsRegressor	0.31	0.33	0.66	0.02
TweedieRegressor	0.30	0.32	0.67	0.01
GammaRegressor	0.30	0.32	0.67	0.01
OrthogonalMatchingPursuit	0.21	0.24	0.71	0.00
RANSACRegressor	0.07	0.10	0.77	0.05
DecisionTreeRegressor	0.03	0.06	0.78	0.01
ElasticNet	-0.04	-0.01	0.81	0.01
DummyRegressor	-0.04	-0.01	0.81	0.00
Lasso	-0.04	-0.01	0.81	0.01
LassoLars	-0.04	-0.01	0.81	0.00
ExtraTreeRegressor	-0.09	-0.05	0.83	0.01
PassiveAggressiveRegressor	-0.18	-0.14	0.86	0.01
GaussianProcessRegressor	-3.42	-3.27	1.67	0.10
KernelRidge	-50.33	-48.56	5.69	0.06

PCA 적용 후 Lazypredict 사용

표준화된 데이터를 가지고 PCA를 통해 차원을 축소한 뒤 Lazypredict를 사용해보자.

from sklearn.decomposition import PCA

pca = PCA(n_components=9)
pca.fit(x_train)

x_train_ = pca.transform(x_train_scaled)
x_test_ = pca.transform(x_test_scaled)

ExtraTreesRegressor 모델의 성능 지표들의 값이 위 결과 보다 약간 좋아진 걸 알 수 있다.

from lazypredict.Supervised import LazyRegressor

reg = LazyRegressor(verbose=0, predictions=True)
models, predictions = reg.fit(x_train_, x_test_, y_train, y_test)
models

100%|███████████████████████████████████████████████████████████████████████████████| 41/41 [00:03<00:00, 12.48it/s]

	Adjusted R-Squared	R-Squared	RMSE	Time Taken
Model
ExtraTreesRegressor	0.56	0.57	0.53	0.21
RandomForestRegressor	0.56	0.57	0.53	0.43
LGBMRegressor	0.52	0.54	0.55	0.04
HistGradientBoostingRegressor	0.52	0.54	0.55	0.42
BaggingRegressor	0.52	0.53	0.55	0.05
GradientBoostingRegressor	0.49	0.51	0.57	0.22
XGBRegressor	0.48	0.50	0.57	0.11
SVR	0.45	0.46	0.59	0.05
NuSVR	0.44	0.46	0.59	0.06
AdaBoostRegressor	0.39	0.40	0.62	0.11
KNeighborsRegressor	0.38	0.40	0.63	0.02
Lars	0.38	0.40	0.63	0.01
TransformedTargetRegressor	0.38	0.40	0.63	0.01
LinearRegression	0.38	0.40	0.63	0.00
Ridge	0.38	0.40	0.63	0.01
RidgeCV	0.38	0.40	0.63	0.01
BayesianRidge	0.38	0.40	0.63	0.01
LassoLarsIC	0.38	0.40	0.63	0.01
LassoCV	0.38	0.39	0.63	0.04
ElasticNetCV	0.38	0.39	0.63	0.05
SGDRegressor	0.38	0.39	0.63	0.01
LassoLarsCV	0.38	0.39	0.63	0.01
LarsCV	0.38	0.39	0.63	0.02
HuberRegressor	0.37	0.39	0.63	0.02
PoissonRegressor	0.37	0.38	0.63	0.01
OrthogonalMatchingPursuitCV	0.36	0.38	0.64	0.01
LinearSVR	0.35	0.37	0.64	0.01
MLPRegressor	0.35	0.37	0.64	1.08
TweedieRegressor	0.28	0.30	0.68	0.01
GammaRegressor	0.28	0.30	0.68	0.01
RANSACRegressor	0.24	0.27	0.69	0.04
OrthogonalMatchingPursuit	0.24	0.26	0.69	0.01
DecisionTreeRegressor	0.19	0.21	0.72	0.01
ExtraTreeRegressor	0.12	0.15	0.75	0.01
LassoLars	-0.03	-0.01	0.81	0.01
DummyRegressor	-0.03	-0.01	0.81	0.00
ElasticNet	-0.03	-0.01	0.81	0.01
Lasso	-0.03	-0.01	0.81	0.00
PassiveAggressiveRegressor	-0.61	-0.57	1.01	0.00
GaussianProcessRegressor	-2.42	-2.33	1.47	0.10
KernelRidge	-50.04	-48.60	5.69	0.05

Lazypredict의 Regression 기능을 간단하게 살펴본 결과, 이 라이브러리는 최적의 ML 모델을 자동으로 찾아준다기 보다는 간략하게 여러 ML 모델들의 성능을 비교해보는 용도로 사용하는 것이 적합해 보인다.

참고

lazypredict GitHub

Twitter Facebook LinkedIn

Lazypredict를 이용해 ML모델 선택하기(Regression)

Lazypredict

예제 데이터 로드

quality 컬럼 분할

학습/검증 데이터 분할

Lazypredict를 통한 자동 모델 학습

단일 모델 선택

표준화 후 결과 비교

PCA 적용 후 Lazypredict 사용

참고

공유하기

댓글남기기

참고

Spark Kafka 설치 방법(Docker Compose)

Running LLM locally with GGUF files

GGUF 파일로 로컬에서 LLM 실행하기

LLM 모델 저장 형식 GGML, GGUF