KaaS Support

Rackspace Kubernetes-as-a-Service

Last updated: Feb 11, 2022

Release: v6.0.2

This section is a quick reference for Rackspace support engineers who have questions about Rackspace Kubernetes-as-a-Service (KaaS). The section includes the information about monitoring, troubleshooting, and upgrading a Rackspace KaaS cluster.

Getting started with Kubernetes

KaaS is based on upstream Kubernetes, and many upstream concepts and operations apply to Kubernetes clusters deployed by KaaS. If you are new to Kubernetes, use these resources to get started:

Common customer support operations

This section includes information about customer support operations for Rackspace Kubernetes-as-a-Service (KaaS).

Create a monitoring suppression

Before performing maintenance, create a new suppression for the monitored environments:

  1. Log in to the Rackspace Business Automation (RBA) portal.
  2. Go to Event Management -> Suppression Manager 3.0 -> Schedule New Suppression.
  3. Fill in each of the tabs. Use the correct account, ticket, and device numbers.
  4. Confirm that the suppression is added to the maintenance ticket.

Resize a Kubernetes cluster

You can resize a Kubernetes cluster by adding or removing Kubernetes worker nodes, etcd nodes, or other nodes.

The kaasctl cluster scale command is under development. For emergency resize operations, use the procedure described in K8S-2052.

Replace a Kubernetes node

If one of the nodes in the Kubernetes cluster fails, you can replace it by using the kaasctl cluster replace-node command.

To repair a failed Kubernetes node, run the following commands:

  1. If you have not yet done so, start an interactive Docker session as described in Start an interactive Docker session.
  2. View the list of nodes in your cluster:
kaasctl cluster list-nodes <cluster-name>

Example:

kaasctl cluster list-nodes kubernetes-test

NodeID                                                  ProviderID                           Name                  Type  
openstack_compute_instance_v2.etcd.0                    2371c70b-d8e7-44e9-ab9b-6ff3b5d8cc7c kubernetes-test-etcd-1          etcd  
openstack_compute_instance_v2.etcd.1                    ed664934-0ae4-4532-83a9-305a2f9e1e7c kubernetes-test-etcd-2          etcd  
openstack_compute_instance_v2.etcd.2                    431aa4cc-10d2-46d2-be36-85ebaeb619e3 kubernetes-test-etcd-3          etcd  
openstack_compute_instance_v2.k8s_master_no_etcd.0      11b5c173-d5be-4bf4-9a2a-1bebbd9a70f8 kubernetes-test-k8s-master-ne-1 master  
openstack_compute_instance_v2.k8s_master_no_etcd.1      6fcaca4a-502d-40b9-bd7f-1e47a98458cb kubernetes-test-k8s-master-ne-2 master  
openstack_compute_instance_v2.k8s_master_no_etcd.2      73c4952d-b02a-4493-a188-592f6f56dd2b kubernetes-test-k8s-master-ne-3 master  
openstack_compute_instance_v2.k8s_node_no_floating_ip.0 ae106171-4ee1-4e2b-9478-bcdb242b8f01 kubernetes-test-k8s-node-nf-1   worker  
openstack_compute_instance_v2.k8s_node_no_floating_ip.1 c9a74608-27f3-4370-a2b6-5531a0b2a288 kubernetes-test-k8s-node-nf-2   worker  
openstack_compute_instance_v2.k8s_node_no_floating_ip.2 8473185a-43bf-472e-bc8a-bb93611b889f kubernetes-test-k8s-node-nf-3   worker  
openstack_compute_instance_v2.k8s_node_no_floating_ip.3 b0e7ed19-42ee-45ba-b371-1a70bffc4ed5 kubernetes-test-k8s-node-nf-4   worker
  1. Replace a failed node by using the NodeID from the output of the kaasctl cluster list-nodes command:
kaasctl cluster replace-node <cluster-name> <NodeID>

Example:

kaasctl cluster replace-node kubernetes-test openstack_compute_instance_v2.k8s_node_no_floating_ip.0

The resource openstack_compute_instance_v2.k8s_node_no_floating_ip.0 in the module root.compute has been marked as tainted!  
Initializing modules...

- module.network
- module.ips
- module.compute
- module.loadbalancer
- module.dns

Initializing provider plugins...
  1. When prompted, type yes.
Enter a value:Do you want to perform these actions?  
Terraform will perform the actions described above.  
Only 'yes' will be accepted to approve.

Enter a value:

This operation might take up to 15 minutes.

Example of system response:

...  
Gathering Facts ----------------------------------------------------- 3.33s  
fetch admin kubeconfig from Kube master ----------------------------- 2.18s  
fetch Kube CA key from Kube master ---------------------------------- 2.09s  
fetch Etcd CA cert from Kube master --------------------------------- 2.07s  
fetch Etcd client cert from Kube master ----------------------------- 2.05s  
fetch Kube CA cert from Kube master ----------------------------------2.05s  
fetch Etcd client key from Kube master -------------------------------1.95s  
write updated kubeconfig to file -------------------------------------0.82s  
download : Sync container --------------------------------------------0.80s  
download : Download items --------------------------------------------0.77s  
use apiserver lb domain name for server entries in kubeconfig --------0.46s  
read in fetched kubeconfig -------------------------------------------0.25s  
kubespray-defaults : Configure defaults ------------------------------0.16s  
download : include_tasks ---------------------------------------------0.10s
  1. Exit the interactive Docker session:
exit
  1. Verify the replaced node operation:
kubectl get nodes -o wide

kubectl get nodes -o wide  
NAME                              STATUS                     ROLES     AGE       VERSION   EXTERNAL-IP    OS-IMAGE                                        KERNEL-VERSION   CONTAINER-RUNTIME  
kubernetes-test-k8s-master-ne-1   Ready                      master    21h       v1.11.5   146.20.68.57   Container Linux by CoreOS 1855.4.0 (Rhyolite)   4.14.67-coreos   docker://18.6.1  
kubernetes-test-k8s-master-ne-2   Ready                      master    21h       v1.11.5   146.20.68.54   Container Linux by CoreOS 1855.4.0 (Rhyolite)   4.14.67-coreos   docker://18.6.1  
kubernetes-test-k8s-master-ne-3   Ready                      master    21h       v1.11.5   146.20.68.76   Container Linux by CoreOS 1855.4.0 (Rhyolite)   4.14.67-coreos   docker://18.6.1  
kubernetes-test-k8s-node-nf-1     Ready                      node      21h       v1.11.5   <none>         Container Linux by CoreOS 1855.4.0 (Rhyolite)   4.14.67-coreos   docker://18.6.1  
kubernetes-test-k8s-node-nf-2     Ready                      node      21h       v1.11.5   <none>         Container Linux by CoreOS 1855.4.0 (Rhyolite)   4.14.67-coreos   docker://18.6.1  
kubernetes-test-k8s-node-nf-3     Ready                      node      21h       v1.11.5   <none>         Container Linux by CoreOS 1855.4.0 (Rhyolite)   4.14.67-coreos   docker://18.6.1  
kubernetes-test-k8s-node-nf-4     Ready                      node      21h       v1.11.5   <none>         Container Linux by CoreOS 1855.4.0 (Rhyolite)   4.14.67-coreos   docker://18.6.1

NOTE: If you want to replace more than one node, you need to run this command for each node.

Replace a master node

Master node replacement is currently not implemented.

Replace an etcd node

If one of the nodes in the etcd cluster fails, you can mark it as unhealthy by using the Terraform taint command, replace the node, and then rerun the deployment of the unhealthy components. First, identify the node that needs to be replaced and then run the taint command so that Terraform applies the necessary changes to the cluster.

To repair an unhealthy etcd node, run the following commands:

  1. If you have not yet done so, start an interactive Docker session as described in Start an interactive Docker session.

  2. Change the current directory to the directory with the terraform.tfstate file. Typically, it is located in /<provider-dir>/clusters/<cluster-name>.

  3. Run the terraform taint command:

terraform taint -module='compute' '<etcd-node-name>'

Example for an OpenStack environment:

terraform taint -module='compute' 'openstack_compute_instance_v2.etcd.4'
  1. Redeploy the infrastructure:
kaasctl cluster create <cluster-name> --infra-only
  1. Verify the nodes that kaasctl replaces in the output and, when prompted, type yes.
  2. Remove the failed etcd member from the cluster:

i. Connect to the etcd master node by using SSH. Use the id_rsa_core key stored in <provider-dir>/clusters/<cluster-name>.

ii. Change the directory to /etc/kubernetes/ssl/etcd/.

iii. Get the list of endpoints:

 ps -ef | grep etcd 

iv. Get the status of the endpoints:

ETCDCTL_API=3 etcdctl --endpoints <list-of-endpoints> --cacert="ca.pem" --cert="<cert-name>.pem" --key="<key>.pem" endpoint health

Example:

ETCDCTL_API=3 etcdctl --endpoints https://10.0.0.6:2379,https://10.0.0.13:2379,https://10.0.0.14:2379,https://10.0.0.9:2379,https://10.0.0.7:2379 --cacert="ca.pem" --cert="node-kubernetes-test-k8s-master-ne-1.pem" --key="node-kubernetes-test-k8s-master-ne-1-key.pem" endpoint health

v. Get the list of cluster members:

ETCDCTL_API=3 etcdctl --endpoints <list-of-endpoints> --cacert="ca.pem" --cert="<cert-name>.pem" --key="<key>.pem" member list

Example:

ETCDCTL_API=3 etcdctl --endpoints https://10.0.0.6:2379,https://10.0.0.13:2379,https://10.0.0.14:2379,https://10.0.0.9:2379,https://10.0.0.7:2379  --cacert="ca.pem" --cert="node-kubernetes-test-k8s-master-ne-1.pem" --key="node-kubernetes-test-k8s-master-ne-1-key.pem" member list

Correlate the IP of the unhealthy endpoint above with the correct member.

vi. Remove the unhealthy etc member by using the hash for the correct member:

ETCDCTL_API=3 etcdctl --endpoints <list-of-endpoints> --cacert="ca.pem" --cert="<cert-name>.pem" --key="<key>.pem" member remove <hash>

Example:

ETCDCTL_API=3 etcdctl --endpoints https://10.0.0.6:2379,https://10.0.0.13:2379,https://10.0.0.14:2379,https://10.0.0.9:2379,https://10.0.0.7:2379 --cacert="ca.pem" --cert="node-kubernetes-test-k8s-master-ne-1.pem" --key="node-kubernetes-test-k8s-master-ne-1-key.pem" member remove 503e0d1f76136e08
  1. Terminate the connection to the master node.
  2. Start the kaasctl Docker interactive session.
  3. Recreate the cluster by running:
kaasctl cluster create <cluster-name> --skip-infra
  1. Log in to the master node.
  2. Verify the etcd node status by running the etcdctl endpoint health and etcdctl member list commands.

NOTE: If you want to replace more than one etcd node, you must perform this full procedure for each node.

Replace a load balancer in OpenStack

By default, Rackspace KaaS deploys the following load balancers:

  • The Kubernetes API (deployed by using Terraform - all others are deployed by using Kubernetes)
  • Ingress Controller
  • The Docker registry

All load balancers that are deployed outside of the rackspace-system namespace are managed by the customer.

If a load balancer fails or is in an unrecoverable state, use the openstack loadbalancer failover <lb-id> to replace it. This command works for OpenStack Queens or later.

To replace a load balancer, complete the following steps:

  1. Replace a load balancer:
openstack loadbalancer failover <lb-id>
  1. Optionally, verify the Kubernetes operation by using kubectl:
kubectl get nodes

The command returns a list of Kubernetes nodes.

Update the rpc-environments repository

Every time you make a change to the Terraform environment, such as replacing a node or a load balancer, you must update the rpc-environments repository with the new Terraform *.tfstate file.

To update the *.tfstate file in the rpc-environments repository:

  1. Copy the <provider-dir>/clusters/<cluster-name>/terraform-<cluster>.tfstate to /tmp.
  2. Find the customer’s vault encryption password in the PasswordSafe project.
  3. Encrypt the *.tfstate file copy in /tmp with ansible-vault:
root@infra01:/tmp$ ansible-vault encrypt /tmp/terraform-my-cluster.tfstate  
New Vault password:  
Confirm New Vault password:  
Encryption successful
  1. Create a PR in the customer’s project in the https://github.com/rpc-environments/<customer-id> repository with the updated file.

Network configuration

This section describes some of the KaaS networking concepts, such as network policies, traffic flow, and so on.

Configure network policies

A network policy is a specification of how groups of pods are allowed to communicate with each other and other network endpoints. Network policies are supported in Kubernetes with certain Container Network Interface (CNI) providers, such as Calico and Canal.

By default, Kubernetes clusters are shipped with the Calico CNI provider, which implements network policies.

Calico pods must be running when you execute the following kubectl command:

kubectl -n kube-system get pods | grep calico  
kube-calico-gvqpd                                 2/2       Running   0          1h  
kube-calico-hkwph                                 2/2       Running   0          1h  
kube-calico-hp8hv                                 2/2       Running   0          1h  
kube-calico-jlqxg                                 2/2       Running   0          1h  
kube-calico-p0kl9                                 2/2       Running   0          1h  
kube-calico-pkf1f                                 2/2       Running   0          1h

Verify a network policy

By default, Calico allows all ingress and egress traffic to go in and out of Pods. To see network policies in action, follow the instructions in this demo guide. In addition, see Configure network policies.

Network traffic flow in a worker node

During the deployment, the Kubernetes Installer creates many network ports on the Kubernetes worker node. The following table describes these ports and their function:

Worker node networking ports

Network interfaceDescription
calicNNNPorts created by Calico for Kubernetes pods. Every time the Kubernetes Installer or you create a pod, Calico creates a calicNNN network interface and connects it to the container by using the Container Network Interface (CNI) plugin.
docker0An interface that is created by default during the cluster deployment, but is not used for any traffic.
kube-ipvs0An IP Virtual Server (IPVS) interface that kube-proxy uses to distribute traffic through the Linux Virtual Server (LVS) kernel module.
eth0An Ethernet interface that the worker node uses to connect to other nodes in the Kubernetes cluster, OpenStack network, and the external network.

KaaS deploys kube-calico pods that are responsible for network requests management on each worker node. kube-calico is responsible for setting up a Border Gateway Protocol (BGP) mesh between each node while also being responsible for applying related network policies.

The following diagram describes traffic flow from the pod’s network interface to the eth0:

Traffic

kube-calico configures routing and other important networking settings. In the diagram above, network traffic goes from the pod’s caliNNN interface to kube-calico, which applies network policies, such as blocking certain network ports and so on. From kube-calico, the traffic goes straight to eth0 and is routed as the next hop to Calico on the destination host.

Another way of processing traffic inside a worker is by using kube-proxy. kuber-proxy acts as an internal load balancer and processes network traffic for Kubernetes services with Type=NodePorts, such as MySQL, Reddis, and so on.

The following diagram describes traffic flow from the eth0 to kube-proxy.

Traffic

In the diagram above, kube-proxy receives traffic and sends it either to a pod on this worker node or back to eth0 and to another worker node that has other replicas of that pod to process the request.

For more information about how kube-proxy uses LVS and IPVS, see IPVS-Based In-Cluster Load Balancing Deep Dive.

Routing table

Calico is responsible for delivering IP packets to and from Kubernetes pods. When a pod sends an Address Resolution Protocol (ARP) request, Calico is always the next IP hop on the packet delivery journey. Calico applies the iptables rules to the request and then sends them to the other pod directly, kube-proxy, or sends them outside the Kubernetes cluster through the tenant network. The routing information is stored in a Linux routing table on the Calico worker node.

The following text is an example of a routing table on a worker node:

$ route
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:bf:cb:fc brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.6/24 brd 10.0.0.255 scope global dynamic eth0
       valid_lft 76190sec preferred_lft 76190sec
    inet6 fe80::f816:3eff:febf:cbfc/64 scope link
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:48:7e:36:f2 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:48ff:fe7e:36f2/64 scope link
       valid_lft forever preferred_lft forever
6: kube-ipvs0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 76:42:b6:e4:24:cf brd ff:ff:ff:ff:ff:ff
    inet 10.3.0.1/32 brd 10.3.0.1 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.0.3/32 brd 10.3.0.3 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.55.252/32 brd 10.3.55.252 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.248.85/32 brd 10.3.248.85 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 172.99.65.142/32 brd 172.99.65.142 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.125.227/32 brd 10.3.125.227 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.201.201/32 brd 10.3.201.201 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.92.5/32 brd 10.3.92.5 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.169.148/32 brd 10.3.169.148 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.131.23/32 brd 10.3.131.23 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.202.25/32 brd 10.3.202.25 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.232.245/32 brd 10.3.232.245 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.249.193/32 brd 10.3.249.193 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.198.106/32 brd 10.3.198.106 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.81.64/32 brd 10.3.81.64 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.66.106/32 brd 10.3.66.106 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.121.31/32 brd 10.3.121.31 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.13.82/32 brd 10.3.13.82 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.156.254/32 brd 10.3.156.254 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.179.85/32 brd 10.3.179.85 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.20.113/32 brd 10.3.20.113 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.94.143/32 brd 10.3.94.143 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.188.170/32 brd 10.3.188.170 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.10.65/32 brd 10.3.10.65 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.171.165/32 brd 10.3.171.165 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.141.154/32 brd 10.3.141.154 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.159.18/32 brd 10.3.159.18 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.35.42/32 brd 10.3.35.42 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.248.76/32 brd 10.3.248.76 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.133.73/32 brd 10.3.133.73 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.65.201/32 brd 10.3.65.201 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.118.116/32 brd 10.3.118.116 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.82.120/32 brd 10.3.82.120 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.223.10/32 brd 10.3.223.10 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.52.96/32 brd 10.3.52.96 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet6 fe80::7442:b6ff:fee4:24cf/64 scope link
       valid_lft forever preferred_lft forever
9: caliac52d399c5c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
10: cali94f5a40346e@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
11: calif9c60118c9d@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
12: cali12440b20ff9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
13: cali646ffb1b67d@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
14: cali315a9d1eed9@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 5
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever
15: cali3913b4bb0e1@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 6
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link
       valid_lft forever preferred_lft forever

All internal tenant network traffic is served inside of the 10.0.0.0 network (kube-proxy) through the kube-ipvs0 interface. kube-proxy creates IPVSs for each Service IP address respectively.

In the example above, Calico uses the 169.254.169.254 gateway to apply its iptables rules.

The following text is an example of a routing table on a pod:

/ # route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         169.254.1.1     0.0.0.0         UG    0      0        0 eth0
169.254.1.1     *               255.255.255.255 UH    0      0        0 eth0

In the example above,169.254.1.1 is the Calico gateway.

Analyze IP packets using tcpdump

You can use the tcpdump tool to analyze IP packets transmitted to and from pods on a Kubernetes node.

To use tcpdump, complete the following steps:

  1. If you want to generate network traffic for testing purposes, install the net-tools probe on your node:
apiVersion: apps/v1
kind: Deployment
metadata:x
  name: net-tools-deployment
  labels:
    app: net-tools
spec:
  replicas: 1
  selector:
    matchLabels:
      app: net-tools
  template:
    metadata:
      labels:
        app: net-tools
    spec:
      containers:
      - name: nettools
        image: raesene/alpine-nettools
  1. Determine the worker node on which the pod that you want to monitor is running.
  2. Log in to the pod:

Example:

(kubernetes-xgerman/rackspace-system) installer $ kubectl -it exec prometheus-k8s-1  /bin/sh

Example of system response:

Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-k8s-1' to see all of the containers in this pod.

In the example above, we log in to a Prometheus pod.

  1. List all IP addresses on the pod’s network interface:
/prometheus $ ip a

Example of system response:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1400 qdisc noqueue
    link/ether 22:77:54:21:7c:0f brd ff:ff:ff:ff:ff:ff
    inet 10.2.5.10/32 scope global eth0
       valid_lft forever preferred_lft forever
  1. Remember the network address for eth0.
  2. Ping . he pod from the probe.

Example:

5676-c4h6z /bin/sh
/ # ping 10.2.5.10

Example of system response:

PING 10.2.5.10 (10.2.5.10): 56 data bytes
64 bytes from 10.2.5.10: seq=0 ttl=62 time=1.725 ms
64 bytes from 10.2.5.10: seq=1 ttl=62 time=1.250 ms
64 bytes from 10.2.5.10: seq=2 ttl=62 time=1.122 ms
64 bytes from 10.2.5.10: seq=3 ttl=62 time=1.369 ms
64 bytes from 10.2.5.10: seq=4 ttl=62 time=0.771 ms
64 bytes from 10.2.5.10: seq=5 ttl=62 time=0.725 ms
  1. Log in to the worker node.
  2. Validate that traffic gets to the node:
sudo tcpdump -i eth0 icmp

Example of system response:

dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
21:35:34.662425 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 60, length 64
21:35:34.662483 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 60, length 64
21:35:35.662860 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 61, length 64
21:35:35.663682 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 61, length 64
21:35:36.663004 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 62, length 64
21:35:36.663086 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 62, length 64
21:35:37.663531 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 63, length 64
21:35:37.663596 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 63, length 64
21:35:38.663694 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 64, length 64
21:35:38.663784 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 64, length 64
21:35:39.663464 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 65, length 64
21:35:39.663556 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 65, length 64
21:35:40.664055 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 66, length 64
21:35:40.664141 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 66, length 64
  1. Get the name of the pod (caliNNN) network interface by its IP address:

Example:

route | grep 10.2.5.10

Example of system response:

10.2.5.10       0.0.0.0         255.255.255.255 UH    0      0        0 calie2afdf225c0
  1. Validate the traffic on the pod’s (caliNNN) network interface:

    Example:

   sudo tcpdump -i calie2afdf225c0 icmp

Example of system response:

dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on calie2afdf225c0, link-type EN10MB (Ethernet), capture size 262144 bytes
21:37:36.693484 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 182, length 64
21:37:36.693544 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 182, length 64
21:37:37.693764 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 183, length 64
21:37:37.693802 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 183, length 64
21:37:38.693562 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 184, length 64
21:37:38.693601 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 184, length 64
21:37:39.693902 IP 10.2.3.15 > 10.2.5.10: ICMP echo request, id 11520, seq 185, length 64
21:37:39.693943 IP 10.2.5.10 > 10.2.3.15: ICMP echo reply, id 11520, seq 185, length 64

Add resource limits to applications

Resource limits are a critical part of a production deployment. Without defined limits, Kubernetes overschedules workloads, leading to “noisy neighbor” problems. Such behavior is particularly troublesome for workloads that come in bursts, which might expand to use all of the available resources on a Kubernetes worker node.

Adding resource limits

Kubernetes has the following classes of resource constraints: limit and request. The request attribute is a soft limit, which might be exceeded if extra resources are available. The limit attribute is a hard limit, which prevents scheduling if worker nodes do not have enough resources available. The limit resource might cause termination of an application if the application attempts to exceed the configured cap.

These limits are specified in the container specification.

Example:

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: db
    image: mysql
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Determining resource limits

Profiling actual usage is the best way to set resource limits. Deploy the application without setting any resource limits, run it through a typical set of tasks, and then examine the resource usage in the monitoring dashboard. You might want to set limits somewhat higher than the observed usage to ensure optimal provisioning. Limits can later be tuned downward if necessary.

Troubleshooting

This section describes how to troubleshoot issues with your Kubernetes cluster, managed services, and underlying components of the Rackspace KaaS solution.

Basic troubleshooting

Maintainers

  • Shane Cunningham (@shanec)

To get help from the maintainers, contact them in #kaas by their Slack name.

Kubernetes

Connect to a Kubernetes node

You can access the Kubernetes master nodes by using floating IPs (FIPs).

  1. To discover the FIPs for the Kubernetes master nodes, set the OpenStack tooling aliases.
  2. List your OpenStack servers:
$ openstack server list -c Name -c Status -c Networks
+--------------------------+--------+--------------------------------------------------+
| Name                     | Status | Networks                                         |
+--------------------------+--------+--------------------------------------------------+
| etoews-rpc1-iad-master-0 | ACTIVE | etoews-rpc1-iad_network=10.0.0.14, 172.99.77.130 |
| etoews-rpc1-iad-worker-0 | ACTIVE | etoews-rpc1-iad_network=10.0.0.6                 |
| etoews-rpc1-iad-worker-1 | ACTIVE | etoews-rpc1-iad_network=10.0.0.9                 |
| etoews-rpc1-iad-master-2 | ACTIVE | etoews-rpc1-iad_network=10.0.0.7, 172.99.77.107  |
| etoews-rpc1-iad-master-1 | ACTIVE | etoews-rpc1-iad_network=10.0.0.12, 172.99.77.100 |
| etoews-rpc1-iad-worker-2 | ACTIVE | etoews-rpc1-iad_network=10.0.0.10                |
+--------------------------+--------+--------------------------------------------------+
  1. Export the SSH_OPTS and MASTER_FIP variables:
$ export SSH_OPTS="-i clusters/${K8S_CLUSTER_NAME}/id_rsa_core -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"

$ export MASTER_FIP=172.99.77.130
  1. Connect to the Kubernetes master node:
$ ssh ${SSH_OPTS} core@${MASTER_FIP}
  1. Get the kubelet logs:
core@clustername-master-0 ~ $ journalctl -u kubelet

By default, Kubernetes worker nodes are not publicly available as they do not have an assigned floating IP address (FIP). To connect to a Kubernetes worker node, first copy the SSH private key located in clusters/*clustername*/id_rsa on the machine from which you deployed the Kubernetes cluster (your laptop or an OpenStack infrastructure node) to a Kubernetes master node and all worker nodes. Then, you can connect to a Kubernetes worker node from the Kubernetes master node.

$ scp ${SSH_OPTS} clusters/${K8S_CLUSTER_NAME}/id_rsa_core core@${MASTER_FIP}:.

$ ssh ${SSH_OPTS} core@${MASTER_FIP}

core@clustername-master-0 ~ $ ssh -i id_rsa_core core@<worker-ip>

core@clustername-worker-0 ~ $ journalctl -u kubelet
Diagnose a Kubernetes cluster

You can use the cluster-diagnostics.sh script to get the information about a Kubernetes cluster.

hack/support/cluster-diagnostics.sh

To simplify log gathering, we recommend that you install this gist CLI.

To use the gist CLI, run the following commands:

gist --login

gist --private --copy clusters/${K8S_CLUSTER_NAME}/logs/*.log

The command above uploads the logs to a private. gist and copies the gist URL to your clipboard.

If the Kubernetes API is reachable, the script uses heptio/sonobuoy to gather diagnostics about the cluster. The collected data is accumulated in the clusters/$K8S_CLUSTER_NAME/logs/sonobuoy-diagnostics.$(date +%Y-%m-%dT%H:%M:%S).tar.gz archive. For more information about the contents of the tarball, see the Sonobuoy documentation.

The script uploads the cluster-diagnostics, sonobuoy-diagnostics, and kubernetes-installer logs and tarballs to Cloud Files. However, for this to work you need to configure the K8S_CF_USER and K8S_CF_APIKEY variables from PasswordSafe. You might also need to pull the latest OpenStack CLI client image.

Known Issues

Warning, FailedMount, attachdetach-controller, AttachVolume.Attach, Invalid request due to incorrect syntax or missing required parameters.

An issue with attachdetach-controller might result in the following error message:

kubectl get events --all-namespaces
NAMESPACE          LAST SEEN   FIRST SEEN   COUNT     NAME                                                             KIND      SUBOBJECT   TYPE      REASON        SOURCE                    MESSAGE
monitoring         32m         1d           72        prometheus-customer-0.15333bb2a31a0e4f                           Pod                   Warning   FailedMount   attachdetach-controller   AttachVolume.Attach failed for volume "pvc-c29aceb0-6385-11e8-abfd-fa163eea581f" : Invalid request due to incorrect syntax or missing required parameters.

If you see a message like the one described above in the event log while the volume is bound to a node and mounted to a pod or container, then your Kubernetes cluster might have other issues unrelated to the volume. For more information, see K8S-1105.

Managed services

Image registry

The image registry is implemented using VMware Harbor. For more information, see the VMware Harbor documentation.

Harbor is a subsystem made of highly-coupled microservices. We recommend that you read Harbor architecture overview to understand the product better.

Harbor’s architecture has evolved since the documentation was written and it is still in transition. VMware is adding and changing services constantly and it might take some time until the final architecture is implemented. You might want to read the updated architecture overview PR that can help you understand some of the newer components and all of the coupling.

See also:

Image registry database

To access the image registry database, run the following commands:

$ REGISTRY_MYSQL_POD_ID=$(kubectl get pods -n rackspace-system | grep registry-mysql | awk '{print $1}')
$ kubectl exec ${REGISTRY_MYSQL_POD_ID} -it -n rackspace-system -- bash
# mysql --user=root --password=$MYSQL_ROOT_PASSWORD --database=registry
mysql> show tables;
mysql> select * from alembic_version;
Elasticsearch

Index pruning

To debug the index pruning, find the latest job run by using the following commands:

$ kubectl get jobs | grep curator
es-purge-1497635820   1         0            3m
$ job=curator-1497635820
$ pods=$(kubectl get pods --selector=job-name=${job} --output=jsonpath={.items..metadata.name})
$ kubectl logs $pods

OpenStack

Connect to an OpenStack environment data plane

Complete the following steps:

  1. Connect to the VPN.

  2. Choose the OpenStack env from one of the <env>.<region>.ohthree.com tabs in the Managed Kubernetes PasswordSafe and find the 10.x.x.x IP and root password.

  3. Connect to the control-1 node of the OpenStack control plane:

ssh [email protected]
  1. View each of the control plane services by typing the following commands:
lxc-ls
  1. Connect to the required service by SSH using the service name. For example:
ssh control-1_cinder_api_container-972603b0
  1. View the service logs in /var/log/<servicename>/. For example:
ls /var/log/cinder/
OpenStack tooling

Set up the aliases that can run the OpenStack client to help you troubleshoot your RPC environment using the following commands:

alias osc='docker run -it --rm --volume ${PWD}:/data --env-file <(env | grep OS_) --env PYTHONWARNINGS="ignore:Unverified HTTPS request" quay.io/rackspace/openstack-cli'
alias openstack='osc openstack --insecure'
alias neutron='osc neutron --insecure'

You can use the following aliases:

openstack server list  
neutron net-list
OpenStack monitoring

To view the monitoring information for a particular RPCO environment, follow these steps:

  1. Log in to https://monitoring.rackspace.net using your Rackspace SSO.
  2. Search for the account environment ID. For example: rpc1.kubernetes.iad.ohthree.com: 4958307
  3. Click on Rackspace Intelligence in the navigation bar to access the Cloud Monitoring. You can view checks without leaving to Rackspace Intelligence.
Delete OpenStack resources

Ocasionally, when you delete a Kubernetes cluster, it might result in orphaned resources. To remove these resources, use the following options:

  • Delete. Use this command, if you cannot connect to the kubernetes API server.
kaasctl cluster delete <cluster-name>
  • OpenStack Dashboard. You can try to delete Kubernetes resources using the Horizon UI. However, its functionality might be limited.
Map OpenStack servers to physical hosts

  1. Log in to the RPC environment using the admin credentials from one of the *.kubernetes.ohthree.com tabs in the Managed Kubernetes PasswordSafe.
  2. In the left hand navigation menu, select Admin > Instances.
  3. In the upper-right corner, filter the instances using Name = from the dropdown. The physical host is listed in the Host column.
Out of Memory, Kubenetes node is in NotReady status

If you found a node that is in the NotReady status, you can analyze the log file of this machine for potential problems by performing these steps:

  1. Log in to the OpenStack Horizon UI.
  2. Go to Project -> Compute -> Instances.
  3. Select the instance name -> Log -> View full log.
    • If you see Memory cgroup out of memory in the log file, this might be the reason of the node in NotReady status. Reboot the node.
RPC dashboards

The following dashboards deployed in rpc1.kubernetes.iad.ohthree.com might help you troubleshoot issues with your Kubernetes cluster:

Troubleshooting Rackspace KaaS

Table of Contents

DNS

Rackspace Kubernetes-as-a-Service (KaaS) uses OpenStack DNS-as-a-Service (Designate) that provides REST APIs for record management, as well as integrates with the OpenStack Identity service (keystone). When Kubernetes Installer creates Kubernetes nodes, it adds DNS records about the nodes in the Designate DNS zone. Kubernetes Installer also adds records for Ingress Controller, Docker registry, and the Kubernetes API endpoints.

You can use the following commands to retrieve information about your Kubernetes cluster:

  • Use the host lookup utility to check the IP address of your Kubernetes master node:
 $ host shanec-cluster-master-0.pug.systems  
shanec-cluster-master-0.pug.systems has address 148.62.13.55
  • Connect to the Kubernetes master node using SSH:
$ ssh -i id_rsa_core [email protected]  
Last login: Fri Jun 23 16:49:09 UTC 2017 from 172.99.99.10 on pts/0  
Container Linux by CoreOS beta (1409.1.0)  
Update Strategy: No Reboots  
core@shanec-cluster-master-0 ~ $

Each cluster has its own SSH key credentials in PasswordSafe. The default user in Container Linux is core.

You can apply this operation to all Kubernetes master nodes in the cluster:

$ kubectl get nodes  
NAME                      STATUS    AGE       VERSION  
shanec-cluster-master-0   Ready     31m       v1.6.4+coreos.0  
shanec-cluster-master-1   Ready     31m       v1.6.4+coreos.0  
shanec-cluster-master-2   Ready     31m       v1.6.4+coreos.0  
shanec-cluster-worker-0   Ready     31m       v1.6.4+coreos.0  
shanec-cluster-worker-1   Ready     31m       v1.6.4+coreos.0  
shanec-cluster-worker-2   Ready     31m       v1.6.4+coreos.0

Logs

The Rackspace KaaS offering deploys Elasticsearch, Fluentd, and Kibana by default. Also, when troubleshooting pods, you can use the kubectl logs command to inspect pod level logs.

Access Kibana information

Kibana visualizes information about your Kubernetes clusters that is collected and stored in Elasticsearch.

You can use the following commands to retrieve information about your Kubernetes cluster provided by Kibana:

  • Access information about Kibana deployments in all namespaces:
$ kubectl get all --all-namespaces -l k8s-app=kibana  
NAMESPACE          NAME                          READY     STATUS    RESTARTS   AGE  
rackspace-system   pod/kibana-76c4c44bcb-nmspf   1/1       Running   0          15h

NAMESPACE          NAME           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE  
rackspace-system   service/logs   ClusterIP   10.3.160.207   <none>        5601/TCP   15h

NAMESPACE          NAME                           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
rackspace-system   deployment.extensions/kibana   1         1         1            1           15h

NAMESPACE          NAME                                      DESIRED   CURRENT   READY     AGE  
rackspace-system   replicaset.extensions/kibana-76c4c44bcb   1         1         1         15h

NAMESPACE          NAME                     DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
rackspace-system   deployment.apps/kibana   1         1         1            1           15h

NAMESPACE          NAME                                DESIRED   CURRENT   READY     AGE  
rackspace-system   replicaset.apps/kibana-76c4c44bcb   1         1         1         15h
  • By default, Kibana is configured with an ingress resource that enables inbound connections reach Kubernetes services. Each Kubernetes cluster has an ingress fully qualified domain name (FQDN) at kibana.${K8S_CLUSTER_NAME}.pug.systems. You can view information about ingress resources by running the following command:
$ kubectl get ingress --all-namespaces  
NAMESPACE          NAME         HOSTS     ADDRESS   PORTS     AGE  
monitoring         prometheus   *                   80        15h  
rackspace-system   grafana      *                   80        15h  
rackspace-system   kibana       *                   80        15h  
rackspace-system   prometheus   *                   80        15h

View information about the Kibana ingress resource by running the following command:

$ kubectl describe ingress -n rackspace-system kibana

System response:

Name:             kibana
Namespace:        rackspace-system
Address:
Default backend:  default-http-backend:80 (<none>)
Rules:
  Host  Path  Backends
  ----  ----  --------
  *
        /logs   logs:5601 (<none>)
Annotations:
  nginx.ingress.kubernetes.io/auth-type:             basic
  nginx.ingress.kubernetes.io/rewrite-target:        /
  nginx.ingress.kubernetes.io/ssl-redirect:          false
  kubectl.kubernetes.io/last-applied-configuration:  {"apiVersion":"extensions/v1beta1","kind":"Ingress","metadata":{"annotations":{"nginx.ingress.kubernetes.io/auth-realm":"Kibana Basic Authentication","nginx.ingress.kubernetes.io/auth-secret":"basic-auth-access","nginx.ingress.kubernetes.io/auth-type":"basic","nginx.ingress.kubernetes.io/rewrite-target":"/","nginx.ingress.kubernetes.io/ssl-redirect":"false"},"labels":{"k8s-app":"kibana"},"name":"kibana","namespace":"rackspace-system"},"spec":{"rules":[{"http":{"paths":[{"backend":{"serviceName":"logs","servicePort":5601},"path":"/logs"}]}}]}}

  nginx.ingress.kubernetes.io/auth-realm:   Kibana Basic Authentication
  nginx.ingress.kubernetes.io/auth-secret:  basic-auth-access
Events:                                     <none>
  • Access the Kibana web user interface (WUI) at https://kibana.${DOMAIN}. Use the credentials stored in PasswordSafe.
Analyze pod logs

Sometimes issues occur with specific pods. For example, a pod might be in a restart loop because the health check is failing. You can check pod logs in Kibana or you can check pod logs from the Kubernetes cluster directly.

To analyze pod issues, use the following instructions:

  1. Get information about pods:
kubectl get pods

For example, you have the following pod with errors:

rackspace-system   po/elasticsearch-3003189550-9k8nz      0/1   Error   26    4h
  1. Search the log file for the error information about this pod:
$ kubectl logs po/elasticsearch-3003189550-9k8nz -n rackspace-system

System response:

[2017-06-21T23:35:15,680][INFO ][o.e.n.Node               ] [] initializing ...
[2017-06-21T23:35:15,826][INFO ][o.e.e.NodeEnvironment    ] [f_DBfoM] using [1] data paths, mounts [[/ (overlay)]], net usable_space [86.3gb], net total_space [94.5gb], spins? [unknown], types [overlay]
[2017-06-21T23:35:15,827][INFO ][o.e.e.NodeEnvironment    ] [f_DBfoM] heap size [371.2mb], compressed ordinary object pointers [true]
[2017-06-21T23:35:15,830][INFO ][o.e.n.Node               ] node name [f_DBfoM] derived from node ID [f_DBfoMATZS89ap-852SWQ]; set [node.name] to override
[2017-06-21T23:35:15,830][INFO ][o.e.n.Node               ] version[5.3.0], pid[1], build[3adb13b/2017-03-23T03:31:50.652Z], OS[Linux/4.11.6-coreos/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_92-internal/25.92-b14]
[2017-06-21T23:35:19,067][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [aggs-matrix-stats]
[2017-06-21T23:35:19,067][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [ingest-common]
[2017-06-21T23:35:19,068][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [lang-expression]
[2017-06-21T23:35:19,068][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [lang-groovy]
[2017-06-21T23:35:19,068][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [lang-mustache]
[2017-06-21T23:35:19,068][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [lang-painless]
[2017-06-21T23:35:19,069][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [percolator]
[2017-06-21T23:35:19,069][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [reindex]
[2017-06-21T23:35:19,069][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [transport-netty3]
[2017-06-21T23:35:19,069][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded module [transport-netty4]
[2017-06-21T23:35:19,070][INFO ][o.e.p.PluginsService     ] [f_DBfoM] loaded plugin [x-pack]
[2017-06-21T23:35:24,974][INFO ][o.e.n.Node               ] initialized
[2017-06-21T23:35:24,975][INFO ][o.e.n.Node               ] [f_DBfoM] starting ...
[2017-06-21T23:35:25,393][WARN ][i.n.u.i.MacAddressUtil   ] Failed to find a usable hardware address from the network interfaces; using random bytes: 80:8d:a8:63:81:52:00:ce
[2017-06-21T23:35:25,518][INFO ][o.e.t.TransportService   ] [f_DBfoM] publish_address {10.2.4.10:9300}, bound_addresses {[::]:9300}
[2017-06-21T23:35:25,531][INFO ][o.e.b.BootstrapChecks    ] [f_DBfoM] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
ERROR: bootstrap checks failed
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
[2017-06-21T23:35:25,575][INFO ][o.e.n.Node               ] [f_DBfoM] stopping ...
[2017-06-21T23:35:25,665][INFO ][o.e.n.Node               ] [f_DBfoM] stopped
[2017-06-21T23:35:25,666][INFO ][o.e.n.Node               ] [f_DBfoM] closing ...
[2017-06-21T23:35:25,691][INFO ][o.e.n.Node               ] [f_DBfoM] closed

In the output above, the pod is failing because of the following error:

ERROR: bootstrap checks failed max virtual memory areas vm.max_map_count
[65530] is too low, increase to at least [262144].

When you see this error, you might need to investigate the deployment or pod YAML.

  1. View the Ingress Controller pod log tail:
$ kubectl logs nginx-ingress-controller-294216488-ncfq9 -n rackspace-system --tail=3
::ffff:172.99.99.10 - [::ffff:172.99.99.10] - - [23/Jun/2017:18:08:18 +0000] "POST /logs/api/monitoring/v1/clusters HTTP/2.0" 200 974 "https://148.62.13.65/logs/app/monitoring" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Firefox/45.0" 117 0.124 [kube-system-logs-5601] 10.2.0.3:5601 751 0.124 200"
::ffff:172.99.99.10 - [::ffff:172.99.99.10] - - [23/Jun/2017:18:08:33 +0000] "GET /logs/api/reporting/jobs/list_completed_since?since=2017-06-23T17:26:26.119Z HTTP/2.0" 200 262 "https://148.62.13.65/logs/app/monitoring" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Firefox/45.0" 66 0.024 [kube-system-logs-5601] 10.2.0.3:5601 37 0.024 200"
::ffff:172.99.99.10 - [::ffff:172.99.99.10] - - [23/Jun/2017:18:08:33 +0000] "POST /logs/api/monitoring/v1/clusters HTTP/2.0" 200 974 "https://148.62.13.65/logs/app/monitoring" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Firefox/45.0" 117 0.160 [kube-system-logs-5601] 10.2.0.3:5601 751 0.160 200"
  1. To get detailed output, run the kubectl logs command in the verbose mode:
$ kubectl logs nginx-ingress-controller-294216488-ncfq9 -n rackspace-system --tail=3 --v=99

System response

I0623 13:10:51.867425   35338 loader.go:354] Config loaded from file /Users/shan5490/code/kubernetes/managed-kubernetes/kubernetes-installer-myfork/clusters/shanec-cluster/generated/auth/kubeconfig
I0623 13:10:51.868737   35338 cached_discovery.go:118] returning cached discovery info from /Users/shan5490/.kube/cache/discovery/shanec_cluster_k8s.pug.systems_443/servergroups.json
I0623 13:10:51.874469   35338 cached_discovery.go:118] returning cached discovery info from /Users/shan5490/.kube/cache/discovery/shanec_cluster_k8s.pug.systems_443/servergroups.json

...

I0623 13:10:51.874602   35338 round_trippers.go:398] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.6.6 (darwin/amd64) kubernetes/7fa1c17" https://shanec-cluster-k8s.pug.systems:443/api/v1/namespaces/kube-system/pods/nginx-ingress-controller-294216488-ncfq9
I0623 13:10:52.044683   35338 round_trippers.go:417] GET https://shanec-cluster-k8s.pug.systems:443/api/v1/namespaces/kube-system/pods/nginx-ingress-controller-294216488-ncfq9 200 OK in 170 milliseconds
I0623 13:10:52.044731   35338 round_trippers.go:423] Response Headers:
I0623 13:10:52.044744   35338 round_trippers.go:426]     Content-Type: application/json
I0623 13:10:52.044751   35338 round_trippers.go:426]     Content-Length: 3396
I0623 13:10:52.044757   35338 round_trippers.go:426]     Date: Fri, 23 Jun 2017 18:10:52 GMT
I0623 13:10:52.045303   35338 request.go:991] Response Body: {"kind":"Pod","apiVersion":"v1","metadata":{"name":"nginx-ingress-controller-294216488-ncfq9","generateName":"nginx-ingress-controller-294216488-","namespace":"kube-system","selfLink":"/api/v1/namespaces/kube-system/pods/nginx-ingress-controller-294216488-ncfq9","uid":"49f7b2f2-5833-11e7-b991-fa163e0178f4","resourceVersion":"1584","creationTimestamp":"2017-06-23T16:44:56Z","labels":{"k8s-app":"nginx-ingress-controller","pod-template-hash":"294216488"},"annotations":{"kubernetes.io/created-by":"{\"kind\":\"SerializedReference\",\"apiVersion\":\"v1\",\"reference\":{\"kind\":\"ReplicaSet\",\"namespace\":\"kube-system\",\"name\":\"nginx-ingress-controller-294216488\",\"uid\":\"49ed0172-5833-11e7-b991-fa163e0178f4\",\"apiVersion\":\"extensions\",\"resourceVersion\":\"1383\"}}\n"},"ownerReferences":[{"apiVersion":"extensions/v1beta1","kind":"ReplicaSet","name":"nginx-ingress-controller-294216488","uid":"49ed0172-5833-11e7-b991-fa163e0178f4","controller":true,"blockOwnerDeletion":true}]},"spec":{"volumes":[{"name":"default-token-b6f5v","secret":{"secretName":"default-token-b6f5v","defaultMode":420}}],"containers":[{"name":"nginx-ingress-controller","image":"gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.3","args":["/nginx-ingress-controller","--default-backend-service=$(POD_NAMESPACE)/default-http-backend"],"ports":[{"hostPort":80,"containerPort":80,"protocol":"TCP"},{"hostPort":443,"containerPort":443,"protocol":"TCP"}],"env":[{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}}],"resources":{},"volumeMounts":[{"name":"default-token-b6f5v","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"livenessProbe":{"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"initialDelaySeconds":10,"timeoutSeconds":1,"periodSeconds":10,"successThreshold":1,"failureThreshold":3},"readinessProbe":{"httpGet":{"path":"/healthz","port":10254,"scheme":"HTTP"},"timeoutSeconds":1,"periodSeconds":10,"successThreshold":1,"failureThreshold":3},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":60,"dnsPolicy":"ClusterFirst","serviceAccountName":"default","serviceAccount":"default","nodeName":"shanec-cluster-worker-1","hostNetwork":true,"securityContext":{},"schedulerName":"default-scheduler"},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2017-06-23T16:44:56Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2017-06-23T16:45:26Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2017-06-23T16:44:56Z"}],"hostIP":"148.62.13.65","podIP":"148.62.13.65","startTime":"2017-06-23T16:44:56Z","containerStatuses":[{"name":"nginx-ingress-controller","state":{"running":{"startedAt":"2017-06-23T16:45:12Z"}},"lastState":{},"ready":true,"restartCount":0,"image":"gcr.io/google_containers/nginx-ingress-controller:0.9.0-beta.3","imageID":"docker-pullable://gcr.io/google_containers/nginx-ingress-controller@sha256:995427304f514ac1b70b2c74ee3c6d4d4ea687fb2dc63a1816be15e41cf0e063","containerID":"docker://2b93a3253696a1498dbe718b0eeb553fde2335f14a81e30837a6fe057d457264"}],"qosClass":"BestEffort"}}
I0623 13:10:52.047274   35338 round_trippers.go:398] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.6.6 (darwin/amd64) kubernetes/7fa1c17" https://shanec-cluster-k8s.pug.systems:443/api/v1/namespaces/kube-system/pods/nginx-ingress-controller-294216488-ncfq9/log?tailLines=3
I0623 13:10:52.082334   35338 round_trippers.go:417] GET https://shanec-cluster-k8s.pug.systems:443/api/v1/namespaces/kube-system/pods/nginx-ingress-controller-294216488-ncfq9/log?tailLines=3 200 OK in 35 milliseconds
I0623 13:10:52.082358   35338 round_trippers.go:423] Response Headers:
I0623 13:10:52.082364   35338 round_trippers.go:426]     Content-Type: text/plain
I0623 13:10:52.082368   35338 round_trippers.go:426]     Content-Length: 1057
I0623 13:10:52.082372   35338 round_trippers.go:426]     Date: Fri, 23 Jun 2017 18:10:52 GMT
::ffff:172.99.99.10 - [::ffff:172.99.99.10] - - [23/Jun/2017:18:10:37 +0000] "GET /logs/api/reporting/jobs/list_completed_since?since=2017-06-23T17:26:26.119Z HTTP/2.0" 200 262 "https://148.62.13.65/logs/app/monitoring" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Firefox/45.0" 66 0.014 [kube-system-logs-5601] 10.2.0.3:5601 37 0.014 200"
::ffff:172.99.99.10 - [::ffff:172.99.99.10] - - [23/Jun/2017:18:10:46 +0000] "POST /logs/api/monitoring/v1/clusters HTTP/2.0" 200 973 "https://148.62.13.65/logs/app/monitoring" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Firefox/45.0" 117 0.177 [kube-system-logs-5601] 10.2.0.3:5601 750 0.177 200"
::ffff:172.99.99.10 - [::ffff:172.99.99.10] - - [23/Jun/2017:18:10:47 +0000] "GET /logs/api/reporting/jobs/list_completed_since?since=2017-06-23T17:26:26.119Z HTTP/2.0" 200 262 "https://148.62.13.65/logs/app/monitoring" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Firefox/45.0" 66 0.013 [kube-system-logs-5601] 10.2.0.3:5601 37 0.013 200"
  • If the pod has been restarted previously, check the previous container’s logs with the --previous flag.
$ kubectl describe po/etcd-operator-4083686351-rh6x7 -n kube-system | grep -A6 'State'
    State:    Running
      Started:    Fri, 23 Jun 2017 11:45:06 -0500
    Last State:   Terminated
      Reason:   Error
      Exit Code:  1
      Started:    Mon, 01 Jan 0001 00:00:00 +0000
      Finished:   Fri, 23 Jun 2017 11:45:05 -0500
    Ready:    True
    Restart Count:  1
$ kubectl logs --previous po/etcd-operator-4083686351-rh6x7 -n kube-system
time="2017-06-23T16:42:48Z" level=info msg="etcd-operator Version: 0.3.0"
time="2017-06-23T16:42:48Z" level=info msg="Git SHA: d976dc4"
time="2017-06-23T16:42:48Z" level=info msg="Go Version: go1.8.1"
time="2017-06-23T16:42:48Z" level=info msg="Go OS/Arch: linux/amd64"
time="2017-06-23T16:43:12Z" level=info msg="starts running from watch version: 0" pkg=controller
time="2017-06-23T16:43:12Z" level=info msg="start watching at 0" pkg=controller
time="2017-06-23T16:43:19Z" level=info msg="creating cluster with Spec (spec.ClusterSpec{Size:1, Version:\"3.1.6\", Paused:false, Pod:(*spec.PodPolicy)(0xc420240e40), Backup:(*spec.BackupPolicy)(nil), Restore:(*spec.RestorePolicy)(nil), SelfHosted:(*spec.SelfHostedPolicy)(0xc420116f70), TLS:(*spec.TLSPolicy)(nil)}), Status (spec.ClusterStatus{Phase:\"Creating\", Reason:\"\", ControlPaused:false, Conditions:[]spec.ClusterCondition(nil), Size:0, Members:spec.MembersStatus{Ready:[]string(nil), Unready:[]string(nil)}, CurrentVersion:\"\", TargetVersion:\"\", BackupServiceStatus:(*spec.BackupServiceStatus)(nil)})" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:43:19Z" level=info msg="migrating boot member (http://10.3.0.200:12379)" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:43:28Z" level=info msg="self-hosted cluster created with boot member (http://10.3.0.200:12379)" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:43:28Z" level=info msg="wait 1m0s before removing the boot member" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:43:28Z" level=info msg="start running..." cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:43:36Z" level=error msg="failed to update members: skipping update members for self hosted cluster: waiting for the boot member (boot-etcd) to be removed..." cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:43:44Z" level=error msg="failed to update members: skipping update members for self hosted cluster: waiting for the boot member (boot-etcd) to be removed..." cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:43:52Z" level=error msg="failed to update members: skipping update members for self hosted cluster: waiting for the boot member (boot-etcd) to be removed..." cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:00Z" level=error msg="failed to update members: skipping update members for self hosted cluster: waiting for the boot member (boot-etcd) to be removed..." cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:08Z" level=error msg="failed to update members: skipping update members for self hosted cluster: waiting for the boot member (boot-etcd) to be removed..." cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:16Z" level=error msg="failed to update members: skipping update members for self hosted cluster: waiting for the boot member (boot-etcd) to be removed..." cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:24Z" level=error msg="failed to update members: skipping update members for self hosted cluster: waiting for the boot member (boot-etcd) to be removed..." cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:35Z" level=info msg="Start reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:35Z" level=info msg="Finish reconciling" cluster-name=kube-etcd pkg=cluster
E0623 16:44:36.200993       1 election.go:259] Failed to update lock: etcdserver: request timed out, possibly due to previous leader failure
time="2017-06-23T16:44:43Z" level=info msg="Start reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:43Z" level=info msg="Finish reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:51Z" level=info msg="Start reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:51Z" level=info msg="Finish reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:59Z" level=info msg="Start reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:44:59Z" level=info msg="Finish reconciling" cluster-name=kube-etcd pkg=cluster
time="2017-06-23T16:45:05Z" level=error msg="received invalid event from API server: fail to decode raw event from apiserver (unexpected EOF)" pkg=controller
time="2017-06-23T16:45:05Z" level=fatal msg="controller Run() ended with failure: fail to decode raw event from apiserver (unexpected EOF)"

Attach to a pod

You can attach to a pod for troubleshooting purposes. However, do not make changes or fix issues through attaching. If a pod is not working as intended, investigate the image, YAML configuration file, or system needs.

  1. Run /bin/bash from the NGINX pod:
$ kubectl -n rackspace-system exec -it nginx-ingress-controller-294216488-ncfq9 -- /bin/bash
  1. If bash fails, try sh:
$ kubectl -n rackspace-system exec -it nginx-ingress-controller-294216488-ncfq9 -- sh`

Now you can perform a few simple operations wit. the pod.

  1. View the list of directories and files:
root@shanec-cluster-worker-1:/# ls -la /
total 26936
drwxr-xr-x.   1 root root     4096 Jun 23 16:45 .
drwxr-xr-x.   1 root root     4096 Jun 23 16:45 ..
-rwxr-xr-x.   1 root root        0 Jun 23 16:45 .dockerenv
-rw-r-----.   2 root root     1194 Feb 22 17:39 Dockerfile
drwxr-xr-x.   2 root root     4096 Jun 23 16:45 bin
drwxr-xr-x.   2 root root     4096 Apr 12  2016 boot
drwxr-xr-x.   5 root root      380 Jun 23 16:45 dev
drwxr-xr-x.   1 root root     4096 Jun 23 16:45 etc
drwxr-xr-x.   2 root root     4096 Apr 12  2016 home
drwxr-x---.   1 root root     4096 Jun 23 16:45 ingress-controller
drwxr-xr-x.   5 root root     4096 Jun 23 16:45 lib
drwxr-xr-x.   2 root root     4096 Jun 23 16:45 lib64
drwxr-xr-x.   2 root root     4096 Jan 19 16:31 media
drwxr-xr-x.   2 root root     4096 Jan 19 16:31 mnt
-rwxr-x---.   2 root root 27410080 Mar 14 21:46 nginx-ingress-controller
drwxr-xr-x.   2 root root     4096 Jan 19 16:31 opt
dr-xr-xr-x. 126 root root        0 Jun 23 16:45 proc
drwx------.   1 root root     4096 Jun 24 23:30 root
drwxr-xr-x.   1 root root     4096 Jun 23 16:45 run
drwxr-xr-x.   2 root root     4096 Mar 14 21:46 sbin
drwxr-xr-x.   2 root root     4096 Jan 19 16:31 srv
dr-xr-xr-x.  13 root root        0 Jun 23 16:39 sys
drwxrwxrwt.   1 root root     4096 Jun 24 23:30 tmp
drwxr-xr-x.  10 root root     4096 Jun 23 16:45 usr
drwxr-xr-x.   1 root root     4096 Jun 23 16:45 var
  1. Display the information about disk space:
root@shanec-cluster-worker-1:/# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          95G  3.2G   88G   4% /
tmpfs           2.0G     0  2.0G   0% /dev
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/vda9        95G  3.2G   88G   4% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           2.0G   12K  2.0G   1% /run/secrets/kubernetes.io/serviceaccount
  1. Display the information about the Linux distribution:
root@shanec-cluster-worker-1:/# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.2 LTS"
Run single commands

You can run single commands to perform operations with pods:

$ kubectl -n kube-system exec nginx-ingress-controller-294216488-ncfq9 ls /var/log/nginx
access.log
error.log

Check the running configuration

When troubleshooting a Kubernetes cluster, you might want to check the YAML configuration file that the deployment uses.

To check the deployment YAML file, run the following command:

$ kubectl get deploy/kibana -o yaml -n rackspace-system > deployment-kibana.yaml

System response:

$ cat deployment-kibana.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"extensions/v1beta1","kind":"Deployment","metadata":{"annotations":{},"labels":{"k8s-app":"kibana"},"name":"kibana","namespace":"kube-system"},"spec":{"replicas":1,"template":{"metadata":{"labels":{"k8s-app":"kibana"}},"spec":{"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"k8s-app","operator":"In","values":["kibana"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"containers":[{"env":[{"name":"ELASTICSEARCH_URL","value":"http://elasticsearch-logging:9200"},{"name":"SERVER_BASEPATH","value":"/logs"}],"image":"docker.elastic.co/kibana/kibana:5.3.0","name":"kibana","ports":[{"containerPort":5601,"name":"ui","protocol":"TCP"}],"readinessProbe":{"httpGet":{"path":"/api/status","port":5601},"initialDelaySeconds":90,"periodSeconds":60},"resources":{"requests":{"cpu":"100m"}}}]}}}}
  creationTimestamp: 2017-06-23T16:44:37Z
  generation: 1
  labels:
    k8s-app: kibana
  name: kibana
  namespace: rackspace-system
  resourceVersion: "2595"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/kibana
  uid: 3ec4a8d0-5833-11e7-acfc-fa163e1f6ecf
spec:
  replicas: 1
  selector:
    matchLabels:
      k8s-app: kibana
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kibana
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: k8s-app
                operator: In
                values:
                - kibana
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: ELASTICSEARCH_URL
          value: http://elasticsearch-logging:9200
        - name: SERVER_BASEPATH
          value: /logs
        image: docker.elastic.co/kibana/kibana:5.3.0
        imagePullPolicy: IfNotPresent
        name: kibana
        ports:
        - containerPort: 5601
          name: ui
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/status
            port: 5601
            scheme: HTTP
          initialDelaySeconds: 90
          periodSeconds: 60
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: 2017-06-23T16:44:56Z
    lastUpdateTime: 2017-06-23T16:44:56Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Namespaces

Kubernetes uses namespaces to isolate Kubernetes resources from each other and manage access control. Namespaces are virtual clusters that provide scope for names and cluster resource management. When you run a command, you need to specify in which namespace a resource is located.

By default, Rackspace KaaS has the following Kubernetes namespaces:

  • default
  • kube-public
  • kube-system
  • monitoring
  • rackspace-system
  • tectonic-system

You can use the following commands to get information about Kubernetes namespaces:

  1. Get the list of namespaces:
 $ kubectl get namespace  
NAME               STATUS    AGE  
default            Active    15h  
kube-public        Active    15h  
kube-system        Active    15h  
monitoring         Active    15h  
rackspace-system   Active    15h  
tectonic-system    Active    15h
  1. Get the list of pods:
$ kubectl get pods  
No resources found.
  1. Get the list of pods for the kube-system namespace:
$ kubectl get pods -n kube-system  
NAME                                             READY     STATUS    RESTARTS   AGE  
default-http-backend-2198840601-phqnn            1/1       Running   0          1d  
elasticsearch-3003189550-91mvh                   1/1       Running   0          1d  
elasticsearch-3003189550-l3gds                   1/1       Running   0          1d  
elasticsearch-3003189550-pz8zn                   1/1       Running   0          1d  
etcd-operator-4083686351-rh6x7                   1/1       Running   1          1d  
...

Failure domains

The following list describes significant failure domains and issues in a Kubernetes cluster deployment:

  • etcd
    • Susceptible to network latency
    • Maintain 50%+1 availabity (will go read-only if fall below threshold)
  • Controller manager and scheduler
    • Controller manager fails: cloud provider integrations, such as neutron LBaaS stop working.
    • Controller manager and scheduler fail: Deployments and so on cannot scale. Failed node workloads do not reschedule.

Troubleshooting unresponsive Kubernetes services

This section describes troubleshooting tasks for Rackapace KaaS managed services.

kube-scheduler

Kubernetes scheduler plays an important role in ensuring Kubernetes resource availability and performance. Verifying that Kubernetes scheduler is operational is one of the essential steps of cluster troubleshooting.

No scheduler pods available

When a Kubernetes service is unavailable, you can use kube-scheduler to troubleshoot the problem. To simulate a scenario where no scheduler pods are available, scale the deployment to zero replicas so that Kubernetes cannot schedule or reschedule any additional pods.

Note: You can use this method with other services, such as kube-controller-manager.

To troubleshoot an unavailable Kubernetes service, complete the following steps:

  1. List all kube-scheduler resources in the kube-system namespace:
 $ kubectl get all -l k8s-app=kube-scheduler -n kube-system  
   NAME                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
   deploy/kube-scheduler   3         3         3            3           2d

   NAME                           DESIRED   CURRENT   READY     AGE  
   rs/kube-scheduler-774c4578b7   3         3         3         2d

   NAME                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
   deploy/kube-scheduler   3         3         3            3           2d

   NAME                           DESIRED   CURRENT   READY     AGE  
   rs/kube-scheduler-774c4578b7   3         3         3         2d

   NAME                                 READY     STATUS    RESTARTS   AGE  
   po/kube-scheduler-774c4578b7-jrd9s   1/1       Running   0          2d  
   po/kube-scheduler-774c4578b7-jtqcl   1/1       Running   0          2d  
   po/kube-scheduler-774c4578b7-nprqb   1/1       Running   0          2d

   NAME                                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE  
   svc/kube-scheduler-prometheus-discovery   ClusterIP   None         <none>        10251/TCP   2d
  1. Scale kube-scheduler to zero replicas:
$ kubectl scale deploy kube-scheduler --replicas 0 -n kube-system  
deployment "kube-scheduler" scaled
  1. List all resources for kube-scheduler:
$ kubectl get all -l k8s-app=kube-scheduler -n kube-system  
NAME                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
deploy/kube-scheduler   0         0         0            0           2d

NAME                           DESIRED   CURRENT   READY     AGE  
rs/kube-scheduler-774c4578b7   0         0         0         2d

NAME                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
deploy/kube-scheduler   0         0         0            0           2d

NAME                           DESIRED   CURRENT   READY     AGE  
rs/kube-scheduler-774c4578b7   0         0         0         2d

NAME                                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE  
svc/kube-scheduler-prometheus-discovery   ClusterIP   None         <none>        10251/TCP   2d

Kubernetes scheduler is now unavailable. To fix this you can act as a scheduler.

  1. Save the kube-scheduler configuration in a *.yaml file:
kubectl get deploy kube-scheduler -n kube-system -o yaml > scheduler.yaml
  1. View the scheduler.yaml file:
$ cat scheduler.yaml

  apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
    creationTimestamp: 2017-10-04T18:52:20Z
    generation: 2
    labels:
      k8s-app: kube-scheduler
      tectonic-operators.coreos.com/managed-by: kube-version-operator
      tier: control-plane
    name: kube-scheduler
    namespace: kube-system
    resourceVersion: "413557"
    selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/kube-scheduler
    uid: 26cb400d-a935-11e7-9e40-fa163ece5424
  spec:
    replicas: 0
    selector:
      matchLabels:
        k8s-app: kube-scheduler
        tier: control-plane
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      creationTimestamp: null
      labels:
        k8s-app: kube-scheduler
        pod-anti-affinity: kube-scheduler-1.7.5-tectonic.1
        tectonic-operators.coreos.com/managed-by: kube-version-operator
        tier: control-plane
  spec:
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              pod-anti-affinity: kube-scheduler-1.7.5-tectonic.1
          namespaces:
          - kube-system
          topologyKey: kubernetes.io/hostname
    containers:
    - command:
      - ./hyperkube
      - scheduler
      - --leader-elect=true
      image: quay.io/coreos/hyperkube:v1.8.0_coreos.0
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 3
        httpGet:
          path: /healthz
          port: 10251
          scheme: HTTP
        initialDelaySeconds: 15
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 15
      name: kube-scheduler
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
    dnsPolicy: ClusterFirst
    nodeSelector:
      node-role.kubernetes.io/master: ""
    restartPolicy: Always
    schedulerName: default-scheduler
    securityContext:
      runAsNonRoot: true
      runAsUser: 65534
    terminationGracePeriodSeconds: 30
    tolerations:
    - key: CriticalAddonsOnly
      operator: Exists
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Exists
  status:
    conditions:
    - lastTransitionTime: 2017-10-04T18:53:34Z
      lastUpdateTime: 2017-10-04T18:53:34Z
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
  observedGeneration: 2
  1. Change the kind of resource from Deployment to Pod, add nodeName which must be one of the Kubernetes master nodes, and remove all the unnecessary parameters:
$ cat scheduler.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    k8s-app: kube-scheduler
    tectonic-operators.coreos.com/managed-by: kube-version-operator
    tier: control-plane
  name: kube-scheduler
  namespace: kube-system
  uid: 26cb400d-a935-11e7-9e40-fa163ece5424
spec:
  nodeName: test-cluster-master-0
  containers:
  - command:
    - ./hyperkube
    - scheduler
    - --leader-elect=true
    image: quay.io/coreos/hyperkube:v1.8.0_coreos.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 10251
        scheme: HTTP
      initialDelaySeconds: 15
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 15
    name: kube-scheduler
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  nodeSelector:
    node-role.kubernetes.io/master: ""
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    runAsNonRoot: true
    runAsUser: 65534
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  1. Create the scheduler pod from the .yaml file:
$ kubectl create -f scheduler.yaml  
pod "kube-scheduler" created
  1. Verify that the pod is running:
$ kubectl get all -l k8s-app=kube-scheduler -n kube-system  
NAME                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
deploy/kube-scheduler   0         0         0            0           2d

NAME                           DESIRED   CURRENT   READY     AGE  
rs/kube-scheduler-774c4578b7   0         0         0         2d

NAME                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
deploy/kube-scheduler   0         0         0            0           2d

NAME                           DESIRED   CURRENT   READY     AGE  
rs/kube-scheduler-774c4578b7   0         0         0         2d

NAME                READY     STATUS    RESTARTS   AGE  
po/kube-scheduler   1/1       Running   0          18s

NAME                                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE  
svc/kube-scheduler-prometheus-discovery   ClusterIP   None         <none>        10251/TCP   2d
  1. Scale the kube-scheduler pod back to three replicas:
$ kubectl scale deploy kube-scheduler --replicas 3 -n kube-system  
deployment "kube-scheduler" scaled

$ kubectl get all -l k8s-app=kube-scheduler -n kube-system  
NAME                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
deploy/kube-scheduler   3         3         3            3           2d

NAME                           DESIRED   CURRENT   READY     AGE  
rs/kube-scheduler-774c4578b7   3         3         3         2d

NAME                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE  
deploy/kube-scheduler   3         3         3            3           2d

NAME                           DESIRED   CURRENT   READY     AGE  
rs/kube-scheduler-774c4578b7   3         3         3         2d

NAME                                 READY     STATUS    RESTARTS   AGE  
po/kube-scheduler                    1/1       Running   0          3m  
po/kube-scheduler-774c4578b7-kdjq7   1/1       Running   0          20s  
po/kube-scheduler-774c4578b7-s5khx   1/1       Running   0          20s  
po/kube-scheduler-774c4578b7-zqmlc   1/1       Running   0          20s

NAME                                      TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)     AGE  
svc/kube-scheduler-prometheus-discovery   ClusterIP   None         <none>        10251/TCP   2d

10.
  1. Delete the standalone pod.
 $ kubectl delete -f scheduler.yaml  
   pod "kube-scheduler" deleted
Persistance volume claims

Re-attach a cinder volume to a Kubernetes worker node

When Kubernetes reschedules a pod to a different worker node, a cinder volume might fail to properly detach and re-attach to the new Kubernetes worker node. In this case, you can detach and re-attach the cinder volume manually to the correct Kubernetes worker node. However, after detaching the cinder volume from the old worker node and attaching it to the proper node with the pod on it, the pod was still unable to mount the required volume. The worker node logged the following information:

Apr 19 19:29:35 kubernetes-worker-2 kubelet-wrapper[3486]: E0419 19:29:35.648597    3486 cinder_util.go:231] error running udevadm trigger executable file not found in $PATH
Apr 19 19:29:35 kubernetes-worker-2 kubelet-wrapper[3486]: W0419 19:29:35.662610    3486 openstack_volumes.go:530] Failed to find device for the volumeID: "f30a02e0-3f1d-4b72-8780-89fcb7f607e2"
Apr 19 19:29:35 kubernetes-worker-2 kubelet-wrapper[3486]: E0419 19:29:35.662656    3486 attacher.go:257] Error: could not find attached Cinder disk "f30a02e0-3f1d-4b72-8780-89fcb7f607e2" (path: ""): <nil>

The cinder volume was properly attached.

kubernetes-worker-2 ~ # lsblk

...

`-vda9  253:9    0 37.7G  0 part  /
vdb     253:16   0    1G  0 disk  /var/lib/rkt/pods/run/e38e20c6-5079-4963-a903-16eac5f78cc4/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/kubelet/pods/ccc7835f-3c5a-11e8-aa40-fa163e75f19c/volumes/kubernetes.io~cinder/pvc-ed6ce616-
vdc     253:32   0    5G  0 disk  /var/lib/rkt/pods/run/e38e20c6-5079-4963-a903-16eac5f78cc4/stage1/rootfs/opt/stage2/hyperkube/rootfs/var/lib/kubelet/pods/bbf8206d-3d26-11e8-a1f7-fa163e9d3696/volumes/kubernetes.io~cinder/pvc-f784e495-
vdd     253:48   0   10G  0 disk

However, it did not appear in /dev/disk/by-id.

kubernetes-worker-2 ~ # ls -la /dev/disk/by-id/
total 0
drwxr-xr-x. 2 root root 140 Apr 11 01:27 .
drwxr-xr-x. 9 root root 180 Apr 11 01:23 ..
lrwxrwxrwx. 1 root root  10 Apr 11 01:24 dm-name-usr -> ../../dm-0
lrwxrwxrwx. 1 root root  10 Apr 11 01:24 dm-uuid-CRYPT-VERITY-81303a5145884861ba3eed4159b13a6e-usr -> ../../dm-0
lrwxrwxrwx. 1 root root  10 Apr 11 01:24 raid-usr -> ../../dm-0
lrwxrwxrwx. 1 root root   9 Apr 11 01:26 virtio-00b864b6-4e42-4447-8 -> ../../vdb
lrwxrwxrwx. 1 root root   9 Apr 11 01:27 virtio-7dbdb566-7da1-4ffd-a -> ../../vdc

It appeared in /dev/disk/by-path/.

kubernetes-worker-2 ~ # ls -la /dev/disk/by-path/
total 0

...

lrwxrwxrwx. 1 root root   9 Apr 11 01:26 virtio-pci-0000:00:0e.0 -> ../../vdb
lrwxrwxrwx. 1 root root   9 Apr 11 01:27 virtio-pci-0000:00:0f.0 -> ../../vdc
lrwxrwxrwx. 1 root root   9 Apr 19 18:19 virtio-pci-0000:00:10.0 -> ../../vdd

A search returned the following from the worker node.

kubernetes-worker-2 ~ # udevadm trigger

After that, vdd recovered and the pod was able to attach the volume and start properly.

kubernetes-worker-2 ~ # ls -la /dev/disk/by-id/
total 0
drwxr-xr-x. 2 root root 160 Apr 19 19:47 .
drwxr-xr-x. 9 root root 180 Apr 11 01:23 ..
lrwxrwxrwx. 1 root root  10 Apr 19 19:47 dm-name-usr -> ../../dm-0
lrwxrwxrwx. 1 root root  10 Apr 19 19:47 dm-uuid-CRYPT-VERITY-81303a5145884861ba3eed4159b13a6e-usr -> ../../dm-0
lrwxrwxrwx. 1 root root  10 Apr 19 19:47 raid-usr -> ../../dm-0
lrwxrwxrwx. 1 root root   9 Apr 19 19:47 virtio-00b864b6-4e42-4447-8 -> ../../vdb
lrwxrwxrwx. 1 root root   9 Apr 19 19:47 virtio-7dbdb566-7da1-4ffd-a -> ../../vdc
lrwxrwxrwx. 1 root root   9 Apr 19 19:47 virtio-f30a02e0-3f1d-4b72-8 -> ../../vdd

Troubleshooting etcd

etcd is a highly-available distributed key-value store for Kubernetes and one of the most critical components of Rackspace KaaS that requires maintenance. The process of etcd backup, restoration, and scaling is currently being developed.

For more information about etcd, see:

Check the etcd cluster health manually

  1. Connect to the etcd node using SSH.
  2. Check the cluster health information using etcdctl:
etcdctl --ca-file=/opt/tectonic/tls/etcd-client-ca.crt  --cert-file=/opt/tectonic/tls/etcd-client.crt --key-file=/opt/tectonic/tls/etcd-client.key --endpoints=https://etcd-0.<domain-name>:2379,https://etcd-1.<domain-name>:2379,https://etcd-2.<domain-name>:2379 cluster-health

View the etcd data directory

The etcd data directory stores the information about etcd configuration in a write-ahead log (WAL). The data directory has the following subdirectories:

  • snap stores snapshots of the log files.
  • wal stores the write-ahead log files.

To view the contents of the snap and wal subdirectories for a selected pod, run the following command:

$ kubectl -n kube-system exec -it kube-etcd-0000 -- ls -lah /var/etcd/kube-system-kube-etcd-0000/member/{snap,wal}

System response:

/var/etcd/kube-system-kube-etcd-0000/member/snap:  
total 5136  
drwx------    2 root     root        4.0K Jun 25 15:35 .  
drwx------    4 root     root        4.0K Jun 25 15:35 ..  
-rw-------    1 root     root       16.0M Jun 25 15:57 db

/var/etcd/kube-system-kube-etcd-0000/member/wal:  
total 125024  
drwx------    2 root     root        4.0K Jun 25 15:35 .  
drwx------    4 root     root        4.0K Jun 25 15:35 ..  
-rw-------    1 root     root       61.0M Jun 25 15:35 0.tmp  
-rw-------    1 root     root       61.0M Jun 25 15:57 0000000000000000-0000000000000000.wal

Troubleshooting octavia

OpenStack octavia is a load balancer that ensures even workload distribution among Kubernetes worker nodes. This section describes common issues with octavia that runs on top of Rackspace Private Cloud Powered by OpenStack (RPCO) for the Rackspace Kubernetes-as-a-Service (KaaS) solution.

Identify the load balancer that backs a public IP address

Typically, a Rackspace KaaS deployment manages load balancer (LB) instances for the Kubernetes API, the Ingress Controller, and the Docker registry. Each instance has a DNS name associated with it that has the following naming conventions:

  • The Kubernetes API instance - *``k8s.my.domain
  • The Ingress Controller instance - kibana.my.domain
  • The Docker registry instance - registry.my.domain

To identify which load balancer is associated with a public floating IP address (FIP) for the Kubernetes cluster, perform the following steps:

  1. Find the public FIP address based on the DNS name.
  2. Find the fixed IP address associated with the public FIP address in the output of the following command:
openstack floating ip list --project <cluster-project-id>
  1. View the list of the deployed load balancers:
openstack loadbalancer list --project <cluster-project-id>

The fixed IP address from the previous step matches the vip_address of the load balancer that backs the public IP.

Example:

$ ping kibana.my-subdomain.mk8s.systems
  PING kibana.my-subdomain.mk8s.systems (172.99.77.50)

$ openstack floating ip list --project 2638a6bef56e4b63a45b1b6a837e5c0e

+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+
| ID                                   | Floating IP Address | Fixed IP Address | Port                                 | Floating Network                     | Project                          |
+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+
| 1546003c-d081-4ca0-a015-3f6654966341 | 172.99.77.25        | 10.0.0.10        | c7f160e7-cd46-4ef7-ae1d-b94bb429ce03 | 2c7b3798-d48d-4b35-b915-51d28f5ffeb8 | 2638a6bef56e4b63a45b1b6a837e5c0e |
| 476ad4ff-6798-4d79-b7c4-22b10bdec987 | 172.99.77.50        | 10.0.0.9         | 9bd6bf8e-52db-437b-81b3-acf374e45f46 | 2c7b3798-d48d-4b35-b915-51d28f5ffeb8 | 2638a6bef56e4b63a45b1b6a837e5c0e |
| 4ef044fb-56db-46b9-8c0f-13b22a015de4 | 172.99.77.185       | 10.242.0.33      | 7530546e-aec9-42e2-a11e-dc09fbba5538 | 2c7b3798-d48d-4b35-b915-51d28f5ffeb8 | 2638a6bef56e4b63a45b1b6a837e5c0e |
| 61644afa-4283-4460-be7c-0025a03f0483 | 172.99.77.92        | 10.0.0.35        | 9464e8f0-56aa-4e3a-979e-a61190bb05b3 | 2c7b3798-d48d-4b35-b915-51d28f5ffeb8 | 2638a6bef56e4b63a45b1b6a837e5c0e |
| 9629a3eb-503e-4792-a4ba-f849d43d77d7 | 172.99.77.127       | 10.0.0.8         | fcf93e0b-1093-4b19-a021-10ba4668d094 | 2c7b3798-d48d-4b35-b915-51d28f5ffeb8 | 2638a6bef56e4b63a45b1b6a837e5c0e |
| c3167915-cf05-4ddd-ad78-ed60ee495c0c | 172.99.77.157       | 10.242.0.21      | eb3181d4-821b-49ca-8583-02431f7d22db | 2c7b3798-d48d-4b35-b915-51d28f5ffeb8 | 2638a6bef56e4b63a45b1b6a837e5c0e |
| d4c1b009-7b1c-459c-abd8-57effc1f28e9 | 172.99.77.187       | 10.0.0.22        | f1801ceb-c5d9-4e8a-bbcf-e54d31f25e6f | 2c7b3798-d48d-4b35-b915-51d28f5ffeb8 | 2638a6bef56e4b63a45b1b6a837e5c0e |
| dc92e666-2892-4072-ab47-2bfdef886b3c | 172.99.77.201       | 10.0.0.19        | f1077048-de22-4890-ba85-21a6eb973a1c | 2c7b3798-d48d-4b35-b915-51d28f5ffeb8 | 2638a6bef56e4b63a45b1b6a837e5c0e |
+--------------------------------------+---------------------+------------------+--------------------------------------+--------------------------------------+----------------------------------+

$ openstack loadbalancer list

+--------------------------------------+----------------------------------+----------------------------------+-------------+---------------------+----------+
| id                                   | name                             | project_id                       | vip_address | provisioning_status | provider |
+--------------------------------------+----------------------------------+----------------------------------+-------------+---------------------+----------+
| b413a684-e166-41d7-814e-4ccf3b9a8945 | a51c6689f3f3a11e88bd2fa163e8c31c | 2638a6bef56e4b63a45b1b6a837e5c0e | 10.0.0.23   | ACTIVE              | octavia  |
| 46ac026b-1138-44b6-9a23-751fcbd609bd | a791cadb1424611e8bda8fa163e8c31c | 2638a6bef56e4b63a45b1b6a837e5c0e | 10.0.0.35   | ACTIVE              | octavia  |
| 2ae6615f-67ba-4fd7-a06e-c7959fe7b748 | k8s-terraform_master             | 2638a6bef56e4b63a45b1b6a837e5c0e | 10.0.0.22   | ACTIVE              | octavia  |
| 6a892fe7-9fb3-4601-bb9c-0b50dc363a18 | a595a2faa428f11e89e3cfa163eba020 | 2638a6bef56e4b63a45b1b6a837e5c0e | 10.0.0.9    | ACTIVE              | octavia  |
+--------------------------------------+----------------------------------+----------------------------------+-------------+---------------------+----------+

In the example above, the load balancer ID is 6a892fe7-9fb3-4601-bb9c-0b50dc363a18.

Check the load balancer status

Use the following command to check the operating health of a load balancer:

openstack loadbalancer show <load-balancer-id>

Octavia load balancers have the following status fields that indicate the status of the deployment:

  • provisioning_status

    The provisioning_status field indicates the status of the most recent action taken against the load balancer deployment. This includes such actions as deployment of a new LB instance or automatic replacement of an LB worker (amphora).

    If the provisioning_status of the load balancer displays ERROR, an underlying issue might be present that is preventing octavia from updating the load balancer deployment. Check the octavia logs on the OpenStack infrastructure nodes of the underlying cloud to determine the issue. After you resolve the issue, replace the load balancer as described in Replacing a Load Balancer.

    Note

    The ERROR status does not always mean that the load balancer cannot serve network traffic. It might indicate a problem with a change in the deployment configuration, not an operational malfunction. However, it implies a problem with the self-healing capabilities of the load balancer.

  • operating_status

    The operating_status field indicates the observed status of the deployment through the health monitor. If the operating_status has any status other than ONLINE, this might indicate one of the following issues:

    • An issue with a service behind the load balancer or the service’s health monitor. Verify that the servers and services behind the load balancer are up and can respond to the network traffic.
    • An issue with the load balancer that requires replacement of the load balancer instance. Typically, if provisioning_status of the load balancer is different from ERROR, the octavia health monitor resolves this issue automatically by failing over and replacing the amphora worker that is experiencing the issue. If the issue persists, and the octavia deployment cannot process the traffic, you might need to replace the load balancer instance as described in Replacing a Load Balancer.

Cannot delete a load balancer deployment

A load balancer deployment might change its status from PENDING_DELETE to ERROR during the deletion and fail to delete. Typically, this happens when OpenStack fails to delete one or more neutron ports. In some cases, re-running the delete command is enough to fix the issue, while in other cases, you might need to find the associated network, manually delete it, and then delete the octavia balancer using the openstack loadbalancer delete command.

To identify and correct this issue, perform the following steps:

  1. Re-run the load balancer deletion command:
openstack loadbalancer delete --cascade <load balancer id>

In some cases, the load balancer deletion opera. ion fails because of slow port deletion.

  1. Check the octavia_worker logs on each OpenStack infrastructure node to find which worker instance processed the deletion of the load balancer.
  2. Log in to the OpenStack infrastructure node.
  3. Run the following command:
root@543230-infra01:/opt/rpc-openstack/openstack-ansible# ansible os-infra_hosts -m shell -a "grep '52034239-a49d-41a9-a58d-7b2b880b98f7' /openstack/og/*octavia*/octavia/octavia-worker.log

System response:

...
543230-infra02 | OK | rc=0 >>
2018-04-16 14:18:35.580 14274 INFO octavia.controller.queue.endpoint [-] Deleting load balancer '52034239-a49d-41a9-a58d-7b2b880b98f7'...
2018-04-16 14:18:39.882 14274 INFO octavia.network.drivers.neutron.allowed_address_pairs [-] Removing security group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab from port c6dbdd75-cce1-4101-8235-3155b08df809

The output above indicates that the octavia worker on the infra02 node successfully completed the deletion operation.

  1. On the infra02 host, search for the messages following the deletion of the load balancing instance in /openstack/log/*octavia*/octavia/octavia-worker.log.

Example:

2018-04-16 14:18:35.580 14274 INFO octavia.controller.queue.endpoint [-] Deleting load balancer '52034239-a49d-41a9-a58d-7b2b880b98f7'...
2018-04-16 14:18:39.882 14274 INFO octavia.network.drivers.neutron.allowed_address_pairs [-] Removing security group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab from port c6dbdd75-cce1-4101-8235-3155b08df809
2018-04-16 14:18:40.653 14274 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 1 to remove security group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab failed.: Conflict: Security Group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab in use.
Neutron server returns request_ids: ['req-8b369249-733f-4728-9e3a-8474e8829562']
2018-04-16 14:18:41.765 14274 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 2 to remove security group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab failed.: Conflict: Security Group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab in use.
Neutron server returns request_ids: ['req-a3eceff7-a416-4a1f-aa0c-db1cb0daf9a9']
2018-04-16 14:18:42.833 14274 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 3 to remove security group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab failed.: Conflict: Security Group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab in use.
Neutron server returns request_ids: ['req-6ad52c9d-329c-4e88-aac0-0c64dfecb801']
2018-04-16 14:18:43.926 14274 WARNING octavia.network.drivers.neutron.allowed_address_pairs [-] Attempt 4 to remove security group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab failed.: Conflict: Security Group 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab in use.
Neutron server returns request_ids: ['req-8710b9cf-91c8-4310-9526-639a3998d93b']

In the example above, the OpenStack Networking service does not allow the security group for the octavia instance to be deleted because of the attached network port. This condition prevents the octavia instance from deleting.

To resolve this issue, find the neutron ports that are associated with the security group mentioned in the worker log and delete them. You can do this by checking the ports that are associated with both the network to which the load balancer is attached and the nova instances with the name that contains “vrrp” (the load balancer amphorae).

  1. Find the vip_network_id associated with the octavia instance:
 1. $ openstack loadbalancer show 52034239-a49d-41a9-a58d-7b2b880b98f7 |grep vip_network_id

System response:

| vip_network_id      | a8a50667-c783-4d0a-a399-422dc8c8ce2b |
  1. Find the port associated with the security group :
$ for i in `openstack port list --network a8a50667-c783-4d0a-a399-422dc8c8ce2b |grep vrrp |awk {'print $2'}`; do echo; echo "port: $i"; openstack port show $i |grep security_group_ids; done

System response:

port: 039fe0bd-cbda-4063-83e3-574ee8e67b39
| security_group_ids    | d2381d94-fbcf-4a22-89bf-8a74e8556e50                                     |

port: 4ba0ceb1-0e4e-437a-b9bd-91b33aaa545f
| security_group_ids    | 16907006-e689-4218-a0e8-360874b72932                                     |

port: 650f6f17-9008-482d-aa25-fa01bf650395
| security_group_ids    | 3e0d5d64-7281-4155-97c4-4cac1dc7dd4c                                     |

port: 8b085666-f128-497e-a83c-2fb193aed30f                      <<-
| security_group_ids    | 5af1ea07-141b-46c7-8b64-8a65e0f7a6ab   <<- Note matching security group  |

port: a6bbbddb-fb00-4bd8-bbad-69b99dac8551
| security_group_ids    | e129a55a-5000-4bce-8bce-b5f205267977                                     |

port: cd0f7f7d-112c-4c21-a970-c57195f1dcf6
| security_group_ids    | d2381d94-fbcf-4a22-89bf-8a74e8556e50                                     |