Implementation of Gesture Control using Hand & Finger Tracking with MediaPipe

12 min readMay 23, 2021

Pose estimation is a computer vision technique in using a machine learning (ML) model to estimate the pose of a human from an image or video by estimating the spatial location of the key body joints. Similar to pose estimation, hand and finger tracking solution also uses ML to infer 21 3D key spatial location(landmarks) of a hand from a single frame of the image. It would allow us to detect the shape, motion of hands and hand gestures. The application of hand and finger tracking would enable a wide application from sign language understanding to human-machine interface control using hand gestures. It is also vital in providing a way to communicate and interact in a Virtual Reality or Augmented Reality environment.

MediaPipe Hands

Introduction

The tracking of hand and fingers has been challenging in the world of computer vision due to the high occurrence of occlusion of finger or palm and lack of contrast patterns. It often requires powerful hardware to run real-time or use special sensors such as a depth sensor. A team from Google came out with a novel solution, MediaPipe Hands in 2020. This solution brings advantage over other solution such as:

An efficient two-stage hand tracking pipeline that can track multiple hands in real-time on mobile devices.
A hand pose estimation model that is capable of predicting 3D hand pose with only RGB input.
Can be deployed on a variety of platforms, including Android, iOS, Web and desktop PCs

*Fig 2: Tracked 3D hand landmarks are represented by dots in different shades, with the brighter ones denoting landmarks closer to the camera.*

Architecture

This solution uses an ML pipeline that consists of two models working together: A palm detection model that will work on the full image and detect the palms using an oriented hand bounding box and a hand landmark model that will work on the cropped hand bounding box provided by the palm detector model and returns a high-fidelity 3D landmarks.

A single-shot palm detector model, BlazePalm Detector is used to detect the initial hand location and infer the bounding box of the palm. As it is only applied on the initial frame, it saves the processing power and allows it to be used in the subsequent hand landmark model for higher accuracy. The Google team used techniques such as encoder-decoder feature extractor for bigger scene context awareness and minimization of focal loss during training to support a large number of anchors resulting from the high variance of hand sizes, this model has an average precision of 95.7% in palm detection.

The hand landmark model would then perform precise keypoint localization of 21 3D landmarks coordinates inside the cropped hand bounding box via regression. The model is robust to even partially visible hands and self-occlusions. If the confidence score falls below a certain threshold, it will re-apply the palm detection model on the next frame. This model will produce 3 outputs as follows:

21 hand landmarks (x, y and relative depth)
A hand flag, probability of hand presence in the input image
A binary classification of left or right hand

The Google team has trained this model with around 30K real-world images annotated manually with 21 3D coordinates shown in Figure 3 below.

Implementation of Gesture Control

One of the biggest possible application for hand tracking solution is in identifying hand gestures and use them as a control mechanism in the human-machine interface. As an example of the implementation of such hand gesture control using MediaPipe Hands, we will create a simple application that is able to control the computer volume using hand gestures by moving the thumb and index finger further apart will increase the volume and vice versa. We will also able to use varying finger positions as hand gestures to mute or unmute your computer.

Step 1: Initialisation

Before we start, install MediaPipe, Python Core Audio Windows Library and OpenCV if you do not already have them. pycaw is a package created by AndreMiras that enables volume control for Windows in python.

pip install mediapipe opencv-python pycaw

We will start by using the boilerplate codes below provided by MediaPipe combined with pycaw. The code will capture frame by frame of the video stream from your webcam. It will convert the image to RGB, then run MediaPipe Hands model on it. It will detect your hand if present, then estimates 21 landmarks on your hand and draw the landmarks (using mp_drawing method) as red circles and green lines between the landmarks.

import cv2
import mediapipe as mp
from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolumedevices = AudioUtilities.GetSpeakers()
interface = devices.Activate(
    IAudioEndpointVolume._iid_, CLSCTX_ALL, None)
volume = cast(interface, POINTER(IAudioEndpointVolume))
volume.GetMute()
volume.GetMasterVolumeLevel()
volumeRange = volume.GetVolumeRange()mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.handscap = cv2.VideoCapture(0)
with mp_hands.Hands(
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5) as hands:
  while cap.isOpened():
    success, image = cap.read()
    if not success:
      print("Ignoring empty camera frame.")
      continue    image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)
    image.flags.writeable = False
    results = hands.process(image)

    # Draw the hand annotations on the image.
    image.flags.writeable = True
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    if results.multi_hand_landmarks:
      for hand_landmarks in results.multi_hand_landmarks:
        mp_drawing.draw_landmarks(
            image, hand_landmarks, mp_hands.HAND_CONNECTIONS)
    cv2.imshow('MediaPipe Hands', image)
    if cv2.waitKey(5) & 0xFF == 27:
      break
cap.release()

You can run the code to test whether landmarks and hand annotation are drawn correctly on your hand. You should able to see the output result as shown in the figure below.

Step 2: Create lists of coordinates from extracted landmarks

results.multi_hand_landmarks[0].landmark will return the coordinates of all 21 landmarks in 3 axes as shown below. The index 0 indicates that it will only infer landmarks for the first hand detected.

In this step, we will multiple these scaled values against the image height and width in order to obtain the coordinates in pixel value. Then, we will append them into a list with the index of landmarks and the corresponding coordinates of x, y.

for id, lm in enumerate(results.multi_hand_landmarks[0].landmark):
  h, w, _ = image.shape
  xc, yc = int(lm.x * w), int(lm.y * h)
  lml.append([id, xc, yc])
  xl.append(xc)
  yl.append(yc)

List of landmarks coordinates in the format of [index, x, y]

Step 3: Indicate thumb and index fingertips and determine the length between them.

We will use the distance between index fingertips and thumb to control the volume.

First, we can obtain the coordinates of thumb tip and index fingertips from landmarks_list based on the index of landmarks. In this case, the index of thumb is 4 and the index of index fingertip is 8 (Refer Fig 5). Then, draw a circle on the fingertips and thumb tip with a line in between them using the obtained coordinates. Next, calculate the distance between the two tips.

x1, y1 = lml[4][1], lml[4][2]
x2, y2 = lml[8][1], lml[8][2]

cv2.circle(image, (x1, y1), 10, (255, 0, 128), cv2.FILLED)
cv2.circle(image, (x2, y2), 10, (255, 0, 128), cv2.FILLED)
cv2.line(image, (x1, y1), (x2, y2), (255, 0, 128), 3)distance = math.hypot(x2 - x1, y2 - y1)cv2.putText(image, str(int(distance)), (cx+30, cy), cv2.FONT_HERSHEY_COMPLEX, 1, (255, 0, 128), 3)

Fig 9: Distance (in pixels) between thumb and index fingertips

Step 4: Create an activation function to check the hand size

The volume will be adjusted using the distance value and the working range of distance is fixed. Asdistance computed based on the pixel between thumb and fingertips, the value would be relative to the distance of the hand from the webcam. Hence we would need to create the function that would only enable the gesture control within a certain distance of the hand from the webcam that would return the valid values that fall within the working range of distance. For example, if the hand is too far from the webcam, we would not be able to increase the volume to 100% because the maximum value ofdistance that we are able to get by extending the distance between thumb and index finger to the fullest would still be less than the fixed value of volumeMax.After experimentation, it is found to be optimal when the area value is set between 300 to 1000 for someone that is sitting in front of the computer.

if 300 < area < 1000:
  cv2.putText(image, 'GestureControl On', (0, 30), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 255, 0), 2)
  cv2.putText(image, str(int(area)), (box[1] + 50, box[1]), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 255, 0), 2)

Fig 10: Gesture Control activated as area detected > 300

Fig 11: Gesture Control deactivated as area detected < 300

Step 5: Compute volume and draw volume information

We will calculate the volumeBarand volumePercent in order to draw the moving volume bar and to adjust the volume by doing a linear interpolation of the distancevalue. The volume bar will be drawn on the right side of the image to feedback the detected volume in percentage stored in volumePercent. For a better user experience, we have included a list of rules to display varying colours of the volume bar to indicate the volume level.

volumeBar = int(np.interp(distance, [50, 200], [400, 150]))
volumePercent = int(np.interp(distance, [50, 200], [0, 100]))

cv2.rectangle(image, (w - 50, 150), (w - 80, 400), (255, 255, 255), 2)
if 21 < volumePercent < 50:
  cv2.rectangle(image, (w - 50, int(volumeBar)), (w - 80, 400), (0, 255, 0), cv2.FILLED)
  cv2.putText(image, f'{int(volumePercent)} %', (w - 100, 450), cv2.FONT_HERSHEY_COMPLEX,
              1, (0, 255, 0), 2)
elif 51 < volumePercent < 80:
  cv2.rectangle(image, (w - 50, int(volumeBar)), (w - 80, 400), (0, 255, 255), cv2.FILLED)
  cv2.putText(image, f'{int(volumePercent)} %', (w - 100, 450), cv2.FONT_HERSHEY_COMPLEX,
              1, (0, 255, 255), 2)
elif volumePercent > 81:
  cv2.rectangle(image, (w - 50, int(volumeBar)), (w - 80, 400), (0, 0, 255), cv2.FILLED)
  cv2.putText(image, f'{int(volumePercent)} %', (w - 100, 450), cv2.FONT_HERSHEY_COMPLEX,
              1, (0, 0, 255), 2)
elif volumePercent < 20:
  cv2.rectangle(image, (w - 50, int(volumeBar)), (w - 80, 400), (255, 255, 0), cv2.FILLED)
  cv2.putText(image, f'{int(volumePercent)} %', (w - 100, 450), cv2.FONT_HERSHEY_COMPLEX,
              1, (255, 255, 0), 2)cVol = int(volume.GetMasterVolumeLevelScalar() * 100)
cv2.putText(image, f'Current Volume: {int(cVol)}', (0, 60), cv2.FONT_HERSHEY_COMPLEX,
            1, (255, 255, 255), 2)

Step 6: Create Finger Check Function

Next, we will create a function to check the fingers whether they are open or closed towards the palm. This information would allow us to set some additional functions such as set volume, mute and unmute depending on the fingers position.

fCount = []
for fid in range(8, 21, 4):
  if lml[fid][2] < lml[fid- 2][2]:
    fCount.append(1)
  else:
    fCount.append(0)

Step 7: Create Set Volume and Mute/ Unmute Function

In this final step, we will create the final function required for this application. This function will perform the following actions below depending on the hand gestures. It detects and determines the hand gestures based on the finger position using the information extracted from Step 6.

Set the volume if we close the pinky finger
Mute if we close middle and ring finger
Unmute if we close middle, ring and pinky finger

The status of mute and function activation will be drawn on the top left corner of the image.

if fCount[3] == 0 and fCount[2] == 1 and fCount[1] == 1 and fCount[0] == 1:
  volume.SetMasterVolumeLevelScalar(volumePercent / 100, None)
  cv2.putText(image, 'Volume Set', (0, 90), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)
  colorVol = (0, 255, 0)
elif fCount[3] == 1 and fCount[2] == 0 and fCount[1] == 0 and muteStatus == False:
  volume.SetMute(1, None)
  cv2.putText(image, 'Muted', (0, 90), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)
  muteStatus = True
elif fCount[3] == 0 and fCount[2] == 0 and fCount[1] == 0 and muteStatus == True:
  volume.SetMute(0, None)
  cv2.putText(image, 'Unmuted', (0, 90), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)
  muteStatus = False

if muteStatus == True:
  cv2.putText(image, "Muted", (0, 120), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)

Optional Step: Frame Rate Counter

We can display an FPS counter just to know the performance of this application by using the codes below.

currentTime = time.time()
fps = 1 / (currentTime - previousTime)
previousTime = currentTime
cv2.putText(image, f'FPS: {int(fps)}', (w-150, 50), cv2.FONT_HERSHEY_COMPLEX,
            1, (255, 255, 255), 2)

Video Demonstration of Completed Application

Sourcecode of Completed Application

You may find the entire sourcecode of the completed application below for your reference.

import cv2
import mediapipe as mp
import time
import math
import numpy as np
from ctypes import cast, POINTER
from comtypes import CLSCTX_ALL
from pycaw.pycaw import AudioUtilities, IAudioEndpointVolume

devices = AudioUtilities.GetSpeakers()
interface = devices.Activate(
    IAudioEndpointVolume._iid_, CLSCTX_ALL, None)
volume = cast(interface, POINTER(IAudioEndpointVolume))
volume.GetMute()
volume.GetMasterVolumeLevel()
volumeRange = volume.GetVolumeRange()
vol = 0
volumeBar = 400
volumePercent = 0
muteStatus = False

mp_drawing = mp.solutions.drawing_utils
mp_hands = mp.solutions.hands

previousTime = 0

cap = cv2.VideoCapture(0)
with mp_hands.Hands(
    min_detection_confidence=0.5,
    min_tracking_confidence=0.5) as hands:
  while cap.isOpened():
    success, image = cap.read()
    if not success:
      print("Ignoring empty camera frame.")
      # If loading a video, use 'break' instead of 'continue'.
      continue

    lml = []
    xl = []
    yl = []
    box = []

    # Flip the image horizontally for a later selfie-view display, and convert
    # the BGR image to RGB.
    image = cv2.cvtColor(cv2.flip(image, 1), cv2.COLOR_BGR2RGB)
    # To improve performance, optionally mark the image as not writeable to
    # pass by reference.
    image.flags.writeable = False
    results = hands.process(image)

    # Draw the hand annotations on the image.
    image.flags.writeable = True
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

    if results.multi_hand_landmarks:
      for hand_landmarks in results.multi_hand_landmarks:
        mp_drawing.draw_landmarks(
            image, hand_landmarks, mp_hands.HAND_CONNECTIONS)

      # Step 2: Create lists of coordinates from extracted landmarks
      for id, lm in enumerate(results.multi_hand_landmarks[0].landmark):
        h, w, _ = image.shape
        xc, yc = int(lm.x * w), int(lm.y * h)
        lml.append([id, xc, yc])
        xl.append(xc)
        yl.append(yc)

      #  Step 3: Obtain coordinates thumb and index finger tips and draw circles on the and a line between them
      x1, y1 = lml[4][1], lml[4][2]
      x2, y2 = lml[8][1], lml[8][2]
      cx, cy = (x1 + x2) // 2, (y1 + y2) // 2
      cv2.circle(image, (x1, y1), 10, (255, 0, 128), cv2.FILLED)
      cv2.circle(image, (x2, y2), 10, (255, 0, 128), cv2.FILLED)
      cv2.line(image, (x1, y1), (x2, y2), (255, 0, 128), 3)
      # cv2.circle(image, (cx, cy), 10, (255, 0, 128), cv2.FILLED)
      distance = math.hypot(x2 - x1, y2 - y1)
      # cv2.putText(image, str(int(distance)), (cx+30, cy), cv2.FONT_HERSHEY_COMPLEX, 1, (255, 0, 128), 3)

      # Step 4: Create an activation function to check the hand size
      xmin, xmax = min(xl), max(xl)
      ymin, ymax = min(yl), max(yl)
      box = xmin, ymin, xmax, ymax
      cv2.rectangle(image, (box[0] - 20, box[1] - 20), (box[2] + 20, box[3] + 20), (255, 255, 0), 2)
      area = (box[2] - box[0]) * (box[3] - box[1]) // 100


      if 300 < area < 1000:
        cv2.putText(image, 'GestureControl On', (0, 30), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 255, 0), 2)
        cv2.putText(image, str(int(area)), (box[1] + 50, box[1]), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 255, 0), 2)

        #Step 5: Compute volume and draw volume information
        volumeBar = int(np.interp(distance, [50, 200], [400, 150]))
        volumePercent = int(np.interp(distance, [50, 200], [0, 100]))

        cv2.rectangle(image, (w - 50, 150), (w - 80, 400), (255, 255, 255), 2)
        if 21 < volumePercent < 50:
          cv2.rectangle(image, (w - 50, int(volumeBar)), (w - 80, 400), (0, 255, 0), cv2.FILLED)
          cv2.putText(image, f'{int(volumePercent)} %', (w - 100, 450), cv2.FONT_HERSHEY_COMPLEX,
                      1, (0, 255, 0), 2)
        elif 51 < volumePercent < 80:
          cv2.rectangle(image, (w - 50, int(volumeBar)), (w - 80, 400), (0, 255, 255), cv2.FILLED)
          cv2.putText(image, f'{int(volumePercent)} %', (w - 100, 450), cv2.FONT_HERSHEY_COMPLEX,
                      1, (0, 255, 255), 2)
        elif volumePercent > 81:
          cv2.rectangle(image, (w - 50, int(volumeBar)), (w - 80, 400), (0, 0, 255), cv2.FILLED)
          cv2.putText(image, f'{int(volumePercent)} %', (w - 100, 450), cv2.FONT_HERSHEY_COMPLEX,
                      1, (0, 0, 255), 2)
        elif volumePercent < 20:
          cv2.rectangle(image, (w - 50, int(volumeBar)), (w - 80, 400), (255, 255, 0), cv2.FILLED)
          cv2.putText(image, f'{int(volumePercent)} %', (w - 100, 450), cv2.FONT_HERSHEY_COMPLEX,
                      1, (255, 255, 0), 2)

        cVol = int(volume.GetMasterVolumeLevelScalar() * 100)
        cv2.putText(image, f'Current Volume: {int(cVol)}', (0, 60), cv2.FONT_HERSHEY_COMPLEX,
                    1, (255, 255, 255), 2)

        #Step 6: Create Finger Check Function
        fCount = []
        for fid in range(8, 21, 4):
          if lml[fid][2] < lml[fid- 2][2]:
            fCount.append(1)
          else:
            fCount.append(0)

        #Step 7: Create Set Volume and Mute/ Unmute Function
        if fCount[3] == 0 and fCount[2] == 1 and fCount[1] == 1 and fCount[0] == 1:
          volume.SetMasterVolumeLevelScalar(volumePercent / 100, None)
          cv2.putText(image, 'Volume Set', (0, 90), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)
          colorVol = (0, 255, 0)
        elif fCount[3] == 1 and fCount[2] == 0 and fCount[1] == 0 and muteStatus == False:
          volume.SetMute(1, None)
          cv2.putText(image, 'Muted', (0, 90), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)
          muteStatus = True
        elif fCount[3] == 0 and fCount[2] == 0 and fCount[1] == 0 and muteStatus == True:
          volume.SetMute(0, None)
          cv2.putText(image, 'Unmuted', (0, 90), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)
          muteStatus = False

        if muteStatus == True:
          cv2.putText(image, "Muted", (0, 120), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)

      else:
        cv2.putText(image, 'GestureControl Off', (0, 30), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)
        cv2.putText(image, str(int(area)), (box[1] + 50, box[1]), cv2.FONT_HERSHEY_COMPLEX, 1, (0, 0, 255), 2)


      # Optional Step: FPS Counter
      currentTime = time.time()
      fps = 1 / (currentTime - previousTime)
      previousTime = currentTime
      cv2.putText(image, f'FPS: {int(fps)}', (w-150, 50), cv2.FONT_HERSHEY_COMPLEX,
                  1, (255, 255, 255), 2)

      cv2.imshow('MediaPipe Hands', image)
    
    if cv2.waitKey(5) & 0xFF == 27:
      break
cap.release()

References:

MediaPipe Hands: On-device Real-time Hand Tracking, by Fan Zhang, Valentin Bazarevsk,y Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang and Matthias Grundmann, 2020
https://google.github.io/mediapipe/solutions/hands