机器学习部署模型服务

SuKai August 8, 2021

当训练好了一个模型,如何对外提供推理服务。Seldon Core是在Kubernetes上部署机器学习模型的流行组件。简单地说,Seldon Core将模型封装成生产级的REST/GRPC微服务。Seldon Core已与Istio、Jeager、Prometheus做了集成,支持灰度发布、A/B测试、链路跟踪、指标监控等。

今天给大家示例的是最简化的使用方式,仅有Seldon Core,无其他开源组件。我只想用Seldon Core来完成我的模型加载和提供API服务。

| 部署Seldon Core Operator

编辑values.yaml,禁用ambassador, istio


ambassador:
  enabled: false
istio:
  enabled: false

因为我的Kubernetes集群版本v1.22.2,所以要修改一下webhook.yaml里的协议版本

  sideEffects: None
  admissionReviewVersions:
  - v1beta1

| 安装

helm install -n seldon-system seldon-core-operator seldon-core-operator

| Prefect工作流

Prefect agent role添加seldon API操作权限

- apiGroups:
  - machinelearning.seldon.io
  resources:
  - seldondeployments
  verbs:
  - '*'

| 修改模型训练任务

在训练模型任务返回MLflow的run_id,归档模型文件时,不注册模型版本

@task
def train_model(data, mlflow_experiment_id, alpha=0.5, l1_ratio=0.5):
    mlflow.set_tracking_uri(f'http://mlflow.platform.sukai.com/')

    train, test = train_test_split(data)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    with mlflow.start_run(experiment_id=mlflow_experiment_id):
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)
        predicted_qualities = lr.predict(test_x)
        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        #mlflow.sklearn.log_model(lr, "model",registered_model_name="ElasticnetWineModel")
        mlflow.sklearn.log_model(lr, "model")

        run_id = mlflow.active_run().info.run_id
        return run_id

| 添加注册模型任务

任务返回模型的下载地址

# Wait until the model is ready
def wait_until_ready(model_name, model_version):
    client = MlflowClient()
    for _ in range(10):
        model_version_details = client.get_model_version(
            name=model_name,
            version=model_version,
        )
        status = ModelVersionStatus.from_string(model_version_details.status)
        print("Model status: %s" % ModelVersionStatus.to_string(status))
        if status == ModelVersionStatus.READY:
            break
        time.sleep(3)

@task
def register_model(run_id: str, model_name: str, stage: str = "staging"):
    client = MlflowClient()
    artifact_path = "model"
    model_uri = "runs:/{run_id}/{artifact_path}".format(run_id=run_id, artifact_path=artifact_path)
    model_details = mlflow.register_model(model_uri=model_uri, name=model_name)

    wait_until_ready(model_details.name, model_details.version)

    client.transition_model_version_stage(
        name=model_details.name,
        version=model_details.version,
        stage=stage,
    )
    
    return model_details.source

| 添加deploy model任务

通过modelUri指定训练好的模型,在Kubernetes中创建SeldonDeployment CR资源。Seldon默认以UID 8888运行容器,发现启动报目录没有写权限,这里指定root运行。

容器在启动后,安装依赖包耗时比较多,设置容器探针启动延时120秒。


seldon_deployment = """
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
      name: wines-classifier
      namespace: ai
    spec:
      name: wines-classifier
      predictors:
      - graph:
          implementation: MLFLOW_SERVER
          modelUri: dummy
          envSecretRefName: seldon-init-container-secret
          name: classifier
        name: default
        replicas: 1
        componentSpecs:
        - spec:
            # We are setting high failureThreshold as installing conda dependencies
            # can take long time and we want to avoid k8s killing the container prematurely
            containers:
            - name: classifier
              image: seldonio/mlflowserver:1.11.2-dev
              securityContext:
                runAsUser: 0
              livenessProbe:
                initialDelaySeconds: 120
                failureThreshold: 100
                periodSeconds: 5
                successThreshold: 1
                httpGet:
                  path: /health/ping
                  port: http
                  scheme: HTTP
              readinessProbe:
                initialDelaySeconds: 120
                failureThreshold: 100
                periodSeconds: 5
                successThreshold: 1
                httpGet:
                  path: /health/ping
                  port: http
                  scheme: HTTP
"""

CUSTOM_RESOURCE_INFO = dict(
    group="machinelearning.seldon.io",
    version="v1",
    plural="seldondeployments",
)

@task
def deploy_model(model_uri: str, namespace: str = "seldon"):
    logger = prefect.context.get("logger")

    logger.info(f"Deploying model {model_uri} to enviroment {namespace}")

    config.load_incluster_config()
    custom_api = client.CustomObjectsApi()

    dep = yaml.safe_load(seldon_deployment)
    dep["spec"]["predictors"][0]["graph"]["modelUri"] = model_uri

    try:
        resp = custom_api.create_namespaced_custom_object(
            **CUSTOM_RESOURCE_INFO,
            namespace=namespace,
            body=dep,
        )

        logger.info("Deployment created. status='%s'" % resp["status"]["state"])
    except:
        logger.info("Updating existing model")
        existing_deployment = custom_api.get_namespaced_custom_object(
            **CUSTOM_RESOURCE_INFO,
            namespace=namespace,
            name=dep["metadata"]["name"],
        )
        existing_deployment["spec"]["predictors"][0]["graph"]["modelUri"] = model_uri

        resp = custom_api.replace_namespaced_custom_object(
            **CUSTOM_RESOURCE_INFO,
            namespace=namespace,
            name=existing_deployment["metadata"]["name"],
            body=existing_deployment,
        )

| 工作流中添加deploy_model任务

    with Flow("train-wine-quality-model", schedule, storage=storage, result=result, run_config=run_config) as flow:
        alpha = Parameter('alpha', default=0.3)
        l1_ratio = Parameter('l1_ratio', default=0.3)
        data = fetch_data()
        run_id = train_model(data=data, mlflow_experiment_id=4, alpha=alpha, l1_ratio=l1_ratio)
        source = register_model(run_id=run_id, model_name="ElasticnetWineModel", stage="staging")
        deploy_model(model_uri=source, namespace="ai")

Seldon模型服务容器启动时会创建conda环境过程耗时且经常安装包失败,所以这里我使用了二次构建的镜像,已经安装好了conda环境,并修改脚本不安装conda环境。

Dockerfile-seldon-mlflowserver

FROM seldonio/mlflowserver:1.11.2
COPY conda.yaml /tmp/conda.yaml
COPY conda_env_create.py /microservice/conda_env_create.py
RUN conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/ && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ && pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
RUN chmod 777 /microservice && mkdir /.cache && chown 8888:8888 /.cache 
RUN conda env create -n mlflow --file /tmp/conda.yaml

conda_env_create.py不执行conda env创建,但仍会安装model依赖包

def create_env(env_file_path):
    """Creates Conda environment from YAML.

    Creates a Conda environment from a YAML file describing Python version,
    dependencies, etc.
    The new environment name is read from the `CONDA_ENV_NAME` environment
    variable.
    If the variable is not defined, it falls back to `mlflow`.
    """
    env_file_name = os.path.basename(env_file_path)
    env_name = os.getenv("CONDA_ENV_NAME", DEFAULT_CONDA_ENV_NAME)
    env_name = quote(env_name)
    env_file_path = quote(env_file_path)

    log.info(f"Creating Conda environment '{env_name}' from {env_file_name}")

    cmd = f"conda env create -n {env_name} --file {env_file_path}"
    #run(cmd, shell=True, check=True)

| 执行流水线

(base) jovyan@jupyter-0:~/ai-demo/cicd$ python wine-quality-pipeline.py 
[2021-11-20 14:37:34+0000] INFO - prefect.S3 | Uploading train-wine-quality-model/2021-11-20t14-37-34-655792-00-00 to prefect-sukai
Flow URL: http://localhost:8080/default/flow/1b3c08e5-3405-432d-8f68-d895769d7ea4
 └── ID: eba90e34-6095-4aa5-b49d-5ccf263637c3
 └── Project: wine-quality-project
 └── Labels: []

| 查看流水线

image-20211121160659888

image-20211121160809141

| 查看Kubernetes

sukai@sukai:~$ kubectl -n ai get SeldonDeployment
NAME               AGE
wines-classifier   17h
sukai@sukai:~$ kubectl -n ai get pods
NAME                                                    READY   STATUS      RESTARTS       AGE
codeserver-0                                            1/1     Running     0              7d21h
sukai-ss-0-0                                             1/1     Running     0              9d
jupyter-0                                               1/1     Running     0              46h
minio-console-68f6f6466f-g5klf                          1/1     Running     0              10d
minio-operator-75f99f579-mpjng                          1/1     Running     0              10d
mlflow-0                                                1/1     Running     0              9d
optuna-dashboard-848fbbc75b-rvw5k                       1/1     Running     0              6d1h
prefect-agent-6455b6897d-h8jst                          1/1     Running     5 (4d4h ago)   4d5h
prefect-apollo-679b57c674-jcjz9                         1/1     Running     2 (4d5h ago)   4d5h
prefect-create-tenant-job--1-925pb                      0/1     Completed   6              4d5h
prefect-graphql-6db5c5d5c-42t2j                         1/1     Running     0              4d5h
prefect-hasura-f74f8b6c8-zh9hb                          1/1     Running     4 (4d5h ago)   4d5h
prefect-postgresql-0                                    1/1     Running     0              4d5h
prefect-towel-5f4bfc58c7-nm8lj                          1/1     Running     0              4d5h
prefect-ui-795c9cb7b9-qkm7p                             1/1     Running     0              4d5h
wines-classifier-default-0-classifier-d5b687dcf-v8nms   2/2     Running     0              17h
sukai@sukai:~$ kubectl -n ai get svc
NAME                                  TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
codeserver                            ClusterIP      None             <none>          8443/TCP                     7d22h
console                               ClusterIP      10.211.92.95     <none>          9090/TCP,9443/TCP            10d
sukai-console                          ClusterIP      10.211.251.129   <none>          9090/TCP                     10d
sukai-hl                               ClusterIP      None             <none>          9000/TCP                     10d
jupyter                               ClusterIP      10.211.23.122    <none>          8888/TCP,7777/TCP,2222/TCP   4d23h
jupyter-headless                      ClusterIP      None             <none>          8888/TCP,7777/TCP,2222/TCP   4d23h
minio                                 ClusterIP      10.211.148.215   <none>          80/TCP                       10d
mlflow                                ClusterIP      None             <none>          5000/TCP                     9d
operator                              ClusterIP      10.211.198.175   <none>          4222/TCP                     10d
optuna-dashboard                      ClusterIP      10.211.126.243   <none>          80/TCP                       6d1h
prefect-apollo                        LoadBalancer   10.211.18.199    <pending>       4200:31555/TCP               4d5h
prefect-graphql                       ClusterIP      10.211.78.161    <none>          4201/TCP                     4d5h
prefect-hasura                        ClusterIP      10.211.75.152    <none>          3000/TCP                     4d5h
prefect-postgresql                    ClusterIP      10.211.175.82    <none>          5432/TCP                     4d5h
prefect-postgresql-headless           ClusterIP      None             <none>          5432/TCP                     4d5h
prefect-ui                            LoadBalancer   10.211.152.173   192.168.0.119   8080:30889/TCP               4d5h
wines-classifier-default              ClusterIP      10.211.74.144    <none>          8000/TCP,5001/TCP            17h
wines-classifier-default-classifier   ClusterIP      10.211.62.109    <none>          9000/TCP,9500/TCP            17h
sukai@sukai:~$
sukai@sukai:~$ kubectl -n ai get ingress
NAME                 CLASS    HOSTS                          ADDRESS   PORTS   AGE
codeserver           <none>   codeserver.platform.sukai.com             80      7d22h
sukai-console         <none>   sukai-minio.platform.sukai.com             80      9d
sukai-minio           <none>   s3.platform.sukai.com                     80      9d
jupyter              <none>   jupyter.platform.sukai.com                80      4d23h
minio-console        <none>   minio.platform.sukai.com                  80      10d
mlflow               <none>   mlflow.platform.sukai.com                 80      9d
optuna-dashboard     <none>   optuna.platform.sukai.com                 80      6d1h
prefect-apollo       <none>   apollo.platform.sukai.com                 80      4d5h
prefect-ui           <none>   prefect.platform.sukai.com                80      4d5h
wine-quality-model   <none>   wine.platform.sukai.com                   80      4h51m
sukai@sukai:~$

| 为模型服务创建Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: wine-quality-model
  namespace: ai
  annotations:
    kubernetes.io/ingress.class: traefik
spec:
  rules:
  - host: wine.platform.sukai.com
    http:
      paths:
      - path: /
        pathType: ImplementationSpecific
        backend:
          service:
            name: wines-classifier-default-classifier
            port:
              name: http

| 调用服务API

image-20211121161526330