Tạo một mô hình học máy dựa vào dữ liệu

Disclaimer

Đào tạo một mô hình học máy hiệu quả là công việc khó khăn. Bản thân mình mặc dù học tập khá là chăm chỉ, nhưng chưa thể tạo được một mô hình học máy hoàn chỉnh. Bài viết này không đề cập tới việc sử dụng các ứng dụng hỗ trợ training và model experiment tracking, do hầu hết chúng mặc dù có thể sử dụng miễn phí, cũng có một giới hạn nhất định. Việc sử dụng công cụ sẽ giúp tiết kiệm nhiều thời gian, nhưng theo mình sẽ giống như nghiện thuốc, dần dần phụ thuộc vào chúng. Bài viêt này được tạo ra với mục đích thuần túy là một báo cáo môn học, không phải một hướng dẫn hay trình tự cần có. Sử dụng nó với đúng nhu cầu, tìm kiếm và kết luận đâu là phương pháp tốt nhất cho chính dự án của bạn.

Cài đặt môi trường để chạy trên local.

Cài đặt python notebook

Cài đặt python theo hướng dẫn: https://www.python.org/downloads/
Cài đặt venv (nếu version chưa có)
Cài đặt jupyter notebook kernel
Tạo một venv tại thư mục dự án: setting venv

Cài đặt jupyter notebook

pip3 install jupyter ipykernel
python3 -m ipykernel install --user --name=ml-demo --display-name "ml-demo"

Tại giao diện notebook, phía trên bên phải, chọn kelnel vừa tạo.

Cài đặt Experiments Tracking tool MLflow

Để thuận tiện, ta sử dụng Docker

docker-compose.yml
services:
  mlflow-server:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "5000:5000"
    volumes:
      - mlflow-data:/mlflow


volumes:
  mlflow-data:

Dockerfile
FROM python:3.13
WORKDIR /mlflow
RUN pip3 install mlflow psutil pynvml
ENV BACKEND_URI sqlite:///mlflow.db
ENV MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING true
ENV MLFLOW_DEPLOYMENT_SERVER_START_TIMEOUT 600
ENV MLFLOW_GATEWAY_SEARCH_ROUTES_PAGE_SIZE 10
EXPOSE 5000
CMD mlflow ui --backend-store-uri $BACKEND_URI --host 0.0.0.0 --port 5000

Khởi tạo container, sau đó truy cập vào địa chỉ web http://localhost:5000/ để vào giao diện web.

Dữ liệu sử dụng

Trong bài này, mình đã chuẩn bị một bộ dữ liệu mô phỏng. Download và sử dụng tại đây.

Huấn luyện mô hình

Xem đầy đủ file notebook tại đây

Xử lý dữ liệu

Đọc dữ liệu từ file csv và kiểm tra một số thông tin của dữ liệu

df = pd.read_csv('data/train_.csv')
df.info()
df.shape
df.isnull().sum()
df.head()

Kết quả cho thấy thông tin các cột, loại dữ liệu của từng cột, số giá trị null của các cột, cuối cùng là một phần dữ liệu.

Dữ liệu dạng time series: bài toán forecast
Dữ liệu dạng nhãn: bài toán phân loại.

Ta có thể dễ dàng nhận ra rằng cột cuối cùng của dữ liệu là Label. Vậy đối với dữ liệu này, bài toán ta cần giải thuộc trường hợp thứ hai.

Phương pháp xử lý dữ liệu thô sơ nhất, là ta xử lý các trường hợp null của các cột. Ở ví dụ này, mình chỉ đơn giản điền các trường null thành -1

data_n_null = df.fillna(-1, inplace=False)

Cuối cùng, ta chia dataset thành các tập train, test.

data_sample = data_n_null.sample(frac=0.3) # We just use 30% of dataset to check, which model compatible.
X = data_sample.drop(columns=['Label'])
y = data_sample['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Đào tạo với mô hình cơ bản

Với bài toán phân loại, có rất nhiều mô hình có thể thực hiện tốt. Với hiểu biết hạn hẹp, ở đây mình chỉ giới thiệu một vài mô hình cơ bản

Ta tiến hành đào tạo và lấy thông số F1, lưu lại các feature có đóng góp quan trọng với mô hình.

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1 score: {f1}")
feature_scores = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
plt.figure(figsize=(20, 20))
sns.barplot(x=feature_scores, y=feature_scores.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
feature_importance_plot = "feature_importance.png"
plt.savefig(feature_importance_plot, bbox_inches='tight')

Tương tự với Random Forest

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1 score: {f1}")
feature_scores = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
plt.figure(figsize=(20, 20))
sns.barplot(x=feature_scores, y=feature_scores.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
feature_importance_plot = "feature_importance.png"
plt.savefig(feature_importance_plot, bbox_inches='tight')

Cuối cùng là XGBoost

model = XGBClassifier()
label_encoder = LabelEncoder()
y_train_num = label_encoder.fit_transform(y_train)
y_test_num = label_encoder.fit_transform(y_test)
model.fit(X_train, y_train_num)
y_pred = model.predict(X_test)
f1 = f1_score(y_test_num, y_pred, average='weighted')
print(f"F1 score: {f1}")
feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=['score']).sort_values(by = 'score', ascending=True)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,20))

Ta lần lượt có được thời gian training, thông số F1 cùng hình ảnh minh họa mức độ quan trọng của các cột (feature) trong dataset với mô hình. Hình dưới là kết quả của mô hình Decision Tree

Sử dụng MLflow để tracking các chỉ số của model

Sử dụng MLflow rất đơn giản. Nhiều model có hỗ trợ. Ta chỉ cần gọi một lệnh.

# Setting mlflow information
ML_TRACKING_URL = "http://localhost:5000"
mlflow.set_tracking_uri(ML_TRACKING_URL)
mlflow.set_experiment("decision_tree")
mlflow.sklearn.autolog()

# Star training model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='weighted')
# Log f1 metric to model metrics
mlflow.log_metric("f1_score", f1)

feature_scores = pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)
plt.figure(figsize=(20, 20))
sns.barplot(x=feature_scores, y=feature_scores.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
feature_importance_plot = "feature_importance.png"
plt.savefig(feature_importance_plot, bbox_inches='tight')

# Save figure to model artifact and remove it from local
mlflow.log_artifact(feature_importance_plot)
os.remove(feature_importance_plot)

Sau khi thực thi xong, các thông số được ghi vào MLflow. Thực thi nhiều lần sẽ được ghi nhiều lần. Ta có thể xem nhiều thông số bằng cách nhấn vào từng experiment.

Sử dụng optuna để hyper parameter turning

Có nhiều thuật toán và thư viện để thực hiện nhiệm vụ này. Bản thân mình đã thử nghiệm GridSearchCV và RandomizedSearchCV của sklearn nhưng chúng quá nặng, chạy chậm. Vì vậy không phù hợp với điều kiện hiện tại.

Optunal sử dụng các study và trial để định nghĩa quá trình hyper param. Một study bao gồm nhiều trial.

def xgboost_objective(trial):
    with mlflow.start_run(nested=True) as run:
        params = {
            "tree_method" : "hist",
            "device" : "cuda",
            "objective": "reg:squarederror",
            "n_estimators": 1000,
            "verbosity": 0,
            "eval_metric" : ["rmse", "mae", "mape", "logloss","error","auc"],
            "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.1, log=True),
            "max_depth": trial.suggest_int("max_depth", 1, 10),
            "subsample": trial.suggest_float("subsample", 0.05, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.05, 1.0),
            "min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
        }
        model = XGBClassifier(**params)
        mlflow.xgboost.autolog()
        model.fit(X_train, y_train, verbose=False)
        y_pred = model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        precision = precision_score(y_test, y_pred, average='weighted')
        recall = recall_score(y_test, y_pred, average='weighted')
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("f1_score", f1)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)

        # Log a file contain some information of model
        metrics_file = "model_summary.json"
        metrics = {
            "parameter" : {**params},
            "metrics" : {
                "f1" : f1,
                "precision" : precision,
                "accuracy" : accuracy,
                "recall" : recall
            }

        }

        with open(metrics_file, "w") as f:
            json.dump(metrics, f, indent=4)

        mlflow.log_artifact(metrics_file)
        os.remove(metrics_file)

        # Save run_id of run to trial
        trial.set_user_attr("run_id", run.info.run_id)
    return f1

# This function will callback after each trial. It will mark which value is best now
def champion_callback(study, frozen_trial):
    winner = study.user_attrs.get("winner", None)
    if study.best_value and winner != study.best_value:
        study.set_user_attr("winner", study.best_value)
        if winner:
            improvement_percent = (abs(winner - study.best_value) / study.best_value) * 100
            print(
                f"Trial {frozen_trial.number} achieved value: {frozen_trial.value} with "
                f"{improvement_percent: .4f}% improvement"
            )
        else:
            print(f"Initial trial {frozen_trial.number} achieved value: {frozen_trial.value}")

tags = {
    "dataset_frac": 1.0,
    "random_state": 42,
    "test_size" : 0.2,
    "droped_column" : ['ID','IPv','Drate','Telnet','SMTP','ARP','cwr_flag_number','ece_flag_number','fin_flag_number','SSH','psh_flag_number','rst_flag_number'], # Base on feature importance score, we will drop some column to decrease dimention.This help us downgrade size of model
    "author": "Son Nguyen"
}

data = data_n_null.drop(columns=tags['droped_column'])
data_sample = data.sample(frac=tags['dataset_frac'])
X = data_sample.drop(columns=['Label'])
y = data_sample['Label']
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = tags['test_size'], random_state = tags['random_state'])

mlflow.set_experiment("xgboost")
optuna.logging.set_verbosity(optuna.logging.ERROR)
with mlflow.start_run(nested=True) as run:

    # mlflow.xgboost.autolog() can not put auto log here
    study = optuna.create_study(direction='maximize')
    study.optimize(xgboost_objective, n_trials=100, timeout=7200, callbacks=[champion_callback], show_progress_bar=True)

    best_trial = study.best_trial
    best_run_id = best_trial.user_attrs['run_id']
    # best_param = study.best_params
    best_value = study.best_value

    model_name = "XGBoost-Classifier"
    client = mlflow.tracking.MlflowClient()
    latest_ = client.get_latest_versions(model_name, stages=None)[0]

    model_uri = f"runs:/{best_run_id}/model"
    best_model = mlflow.register_model(model_uri, model_name)

    best_param = client.get_run(best_run_id).data.params

    # We will check the best trial in this study is higher than existing model. If yes, we will register this trial as new version of Model. Otherwise skip that
    if latest_:
        previous_f1_score = client.get_metric_history(latest_.run_id, "f1_score")[-1].value
        if previous_f1_score >= best_value:
            print(f"Last model is better. Current values {best_value}, latest values {previous_f1_score}")
        else:
            client.update_registered_model(
                name=model_name,
                description="Best moldel",
            )

            for key, value in best_param.items():
                client.set_model_version_tag(
                    name=model_name,
                    version=best_model.version,
                    key=key,
                    value=value
                )

            client.set_model_version_tag(
                name=model_name,
                version=best_model.version,
                key="values",
                value=best_value
            )

            for key, value in tags.items():
                mlflow.set_tag(key,value)
            mlflow.set_tag("job", "xgboost using optuna to search parameter")

Tổng quát, ta sử dụng optuna để suggest giá trị của các parameter trong một range do ta quy định. Nó chạy tối đa bao nhiêu lần trong tối đa bao nhiêu thời gian để tìm parameter tốt nhất, cho giá trị F1 cao nhất.

Sau khi chạy hoàn tất một study với rất nhiều trial, ta thu thập được nhiều thông số của các lần chạy. Ta tiến hành so sánh các mối tương quan của chúng để điều chỉnh khoảng tham số.

Hình trên cho thấy mức liên quan giữa các parameter tới việc chỉ số F1 của mô hình cao hay thấp. Từ đó quyết định mở rộng, thu hẹp, dịch chuyển khoảng tham số mà ta sẽ cung cấp cho optuna để dò tìm.

Info

Sẽ cố gắng cập nhật theo kiến thức mới của mình.