Standalone Deployment for GIE

We have demonstrated how to execute interactive queries easily by installing GraphScope via pip on a local machine. However, in real-life applications, graphs are often too large to fit on a single machine. In such cases, GraphScope can be deployed on a cluster, such as a self-managed k8s cluster, for processing large-scale graphs. But you may wonder, “what if I only need the GIE engine and not the whole package of GraphScope?” This tutorial will walk you through the process of standalone deployment of GIE on a self-managed k8s cluster.

Throughout the tutorial, we assume all machines are running Linux system. We do not guarantee that it works as smoothly as Linux on the other platform. For your reference, we’ve tested the tutorial on Ubuntu 20.04.

Prerequisites

  • Kubernetes Cluster

  • Python >= 3.9

  • JDK 11 (Both JDK 8 and 20 have known compatibility issues)

To get started, you need to prepare a Kubernetes Cluster to continue.

Incase you doesn’t have one, you could refer to the instruction of create kubernetes cluster.

Deploy Your First GIE Service

The easiest way to deploy GIE standalone is by using Helm, which is a package manager for K8s that simplifies the deployment and management of applications. To deploy GIE standalone using Helm, you can follow these steps:

  • Install Helm on your local machine if you do not have it by following the instructions on the official Helm website.

  • Pull the Helm repository to your local disk:

    helm pull graphscope/gie-standalone --untar
    
  • Prepare the etcd pod.

    kubectl apply -f gie-standalone/tools/etcd.yaml
    
  • Prepare graph data

    cp -r gie-standalone/data /tmp/
    

    Check whether the raw data is there:

    tree /tmp/data
    

    You should be able to see the raw data of the modern graph.

    /tmp/data
    └── modern_graph
       ├── created.csv
       ├── knows.csv
       ├── person.csv
       └── software.csv
    

    Then create K8s persistent volume (PV) and persistent volume claim (PVC).

    kubectl apply -f gie-standalone/tools/pvc.yaml
    

    The modern graph raw data in /tmp/data will be automatically loaded into the GIE graph store (by default on Vineyard).

    Tip

    You can load the data from any /path/to/your/data. All you need to do is copy the raw data to /path/to/your/data and modify the hostPath.path in gie-standalone/tools/pvc.yaml to /path/to/your/data.

  • Install the GIE chart:

    helm install [YOUR_RELEASE_NAME] gie-standalone
    
  • Verify that the GIE service is running:

    kubectl get pods
    

    You should see the [YOUR_RELEASE_NAME]-gie-standalone-frontend-0 and [YOUR_RELEASE_NAME]-gie-standalone-store-0 pods running.

  • Get the endpoint of the GIE Frontend service:

    1. get <ip>:<gremlinPort> for gremlin querying

      kubectl describe svc [YOUR_RELEASE_NAME]-gie-standalone-frontend \
      | grep "Endpoints:" | awk -F' ' '{print $2}' | head -1
      
    2. get <ip>:<cypherPort> for cypher querying

      kubectl describe svc [YOUR_RELEASE_NAME]-gie-standalone-frontend \
      | grep "Endpoints:" | awk -F' ' '{print $2}' | tail -1
      
  • Connect to the GIE frontend service by the following two ways:

    1. using the Tinkerpop’s official SDKs or Gremlin console, which can be found here.

    2. using the Neo4j’s official SDKs or Cypher-Shell, which can be found here.

Remove the GIE Service

   helm uninstall [YOUR_RELEASE_NAME]

Using Your Own Data

Currently, a single instance of GIE can only handle one set of graph data. This means that you must indicate which raw data should be uploaded into GIE’s graph store, and all subsequent queries made through the GIE instance will pertain to the uploaded graph.

The above tutorial uses modern graph to demonstrate the launching procedural. However, it’s easy to specify your own data. To do so, you just need to provide a little specification about your data.

Let’s look into the specification of modern graph in gie-standalone/config/v6d_modern_loader.json:

{
    "vertices": [
        {
            "data_path": "$STORE_DATA_PATH/modern_graph/person.csv",
            "label": "person",
            "options": "header_row=true&delimiter=|"
        },
        {
            "data_path": "$STORE_DATA_PATH/modern_graph/software.csv",
            "label": "software",
            "options": "header_row=true&delimiter=|"
        }
    ],
    "edges": [
        {
            "data_path": "$STORE_DATA_PATH/modern_graph/knows.csv",
            "label": "knows",
            "src_label": "person",
            "dst_label": "person",
            "options": "header_row=true&delimiter=|"
        },
        {
            "data_path": "$STORE_DATA_PATH/modern_graph/created.csv",
            "label": "created",
            "src_label": "person",
            "dst_label": "software",
            "options": "header_row=true&delimiter=|"
        }
    ],
    "directed": 1,
    "retain_oid": 1,
    "generate_eid": 1,
}

There’re a few things to notice:

  • For now, we support loading raw data that are a CSV-like files.

  • Prepare an individual file for each type of vertex and edge. For example, in the modern graph, the data of “person” vertex is in the file of modern/person.csv.

  • Place the raw data in the hostPath.path specified above.

  • For each type of vertex, configure

    • data_path: as hostPath.path. The default value is /tmp/data.

    • label: the label of the vertex. For example, “person”, “software”.

    • options: configure as “key1=value1&key2=value2&…”. Details can be found in this guide, while we provide some useful keys here:

      • header_row: define whether the file contains a header, the default value is false.

      • delimiter: the token that separates the data fields of a row of data, the default value is ','.

      • column_types: the data types of all data fields separated by the delimiter. If not specified, such as in the modern graph example, the store will attempt to infer the data types from the raw data. You can also specify according to your need. For example, if there’re two data fields, “filed1” and “filed2”, you can specify column_types=string,int64_t to indicate their types.

  • For each type of edge, configure

    • data_path, label, options are similar to those of vertices. To save you from some unexpected trouble, you’d better make the first two data fields record the ids of the source and destination vertices, and if column_types is given, the first two data fields are configured to int64_t correspondingly.

    • src_label: the label of the source vertex of this edge.

    • dst_label: the label of the destination vertex of this edge.

Tip

For your reference, we have provided a sample for loading LDBC data in gie-standalone/config/v6d_ldbc_loader.json.

Deploy on a Cluster

In K8s, it’s convenient to deploy GIE in a cluster with multiple machines. You don’t need to be aware of the physical machines, but simply configure the number of executors to make GIE scalable. These GIE executors will be seamlessly assigned by K8s to the physical machines.

You simply set the number of executors as:

helm install [YOUR_RELEASE_NAME] graphscope/gie-standalone --set executor.replicaCount=3

This instruction deploys the GIE chart using 3 executors that process graph partitions in v6d. The number of replicas can be modified according to your needs, but better be less than the number of CPUs in your cluster. When specifying the number of executors, v6d loads data from the specified location and partitions graph data automatically for each executor. It is recommended to store data in a distributed file system like HDFS for convenience. In this case, you can simply configure the above data_path to use the hdfs:// scheme.

Other Useful Configurations

Extra configurations can be set as:

helm install [YOUR_RELEASE_NAME] graphscope/gie-standalone --set [key1]=[value1],[key2]=[value2]

We’ve listed useful configuration keys in the following:

Name

Description

Default Value

gremlinPort

the port for accessing the Gremlin service

8182

cypherPort

the port for accessing the Cypher service

7687

pegasusWorkerNum

the number of working threads per each executor

2

pegasusTimeout

the maximum duration in ms you allow each query to run

24,000

pegasusBatchSize

the maximum size of streaming records can be output for an operator

1024

pegasusOutputCapacity

the maximum number of streaming records can be output for an operator

16

frontendQueryPerSecondLimit

the maximum qps can be handled by frontend service

2147483647 (without limitation)

gremlinScriptLanguageName

the option allows you to choose different Gremlin compilations based on different IR layer, either Traversal-based (‘antlr_gremlin_traversal’) or Calcite-based (‘antlr_gremlin_calcite’)

antlr_gremlin_traversal