{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Hierarchical Clustering with MPDist\n", "\n", "In this tutorial you will see how to use the novel MPDist metric to cluster time series data. The time series data used in this example is accelerometer data consisting of individuals performing the following actions:\n", "\n", "1. Working at Computer\n", "2. Standing Up, Walking and Going updown stairs\n", "3. Standing\n", "4. Walking\n", "5. Going UpDown Stairs\n", "6. Walking and Talking with Someone\n", "7. Talking while Standing\n", "\n", "You can read more about the data set here:\n", "\n", "http://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer\n", "\n", "In essence, MPDist considers \"...two time series to be similar if they share many similar subsequences, regardless of the order of matching\n", "subsequences.\" [1] I encourage those interested in the details of the metric to read the paper.\n", "\n", "**Note that all code written in this tutorial has only been tested with Python 3!**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matrixprofile as mp\n", "from scipy.cluster.hierarchy import dendrogram\n", "\n", "from matplotlib import pyplot as plt\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download and Extract Data\n", "\n", "In this section we simply obtain the dataset and extract it." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import urllib.request\n", "import zipfile\n", "import tempfile\n", "import os" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "temp_dir = tempfile.mkdtemp()\n", "url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00287/Activity%20Recognition%20from%20Single%20Chest-Mounted%20Accelerometer.zip'\n", "tmp_file = os.path.join(temp_dir, 'activity_recognition.zip')\n", "extracted_dir = os.path.join(temp_dir, 'extracted')\n", "\n", "os.makedirs(extracted_dir, exist_ok=True)\n", "\n", "with urllib.request.urlopen(url) as response, open(tmp_file, 'wb') as out_file:\n", " data = response.read() # a `bytes` object\n", " out_file.write(data)\n", " \n", "with zipfile.ZipFile(tmp_file, 'r') as zip_ref:\n", " zip_ref.extractall(extracted_dir)\n", " \n", "data_dir = os.path.join(extracted_dir, 'Activity Recognition from Single Chest-Mounted Accelerometer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Raw Data\n", "\n", "In this section we load the raw data and process it. For readability purposes, the target labels are transformed into human readable descriptions." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "label_description = '''\n", "1: Working at Computer\n", "2: Standing Up, Walking and Going updown stairs\n", "3: Standing\n", "4: Walking\n", "5: Going UpDown Stairs\n", "6: Walking and Talking with Someone\n", "7: Talking while Standing\n", "'''\n", "\n", "labels = []\n", "\n", "for line in label_description.split('\\n'):\n", " line = line.strip()\n", " if line == '':\n", " continue\n", " \n", " tmp = line.split(':')\n", " num = tmp[0].strip()\n", " description = tmp[1].strip()\n", " \n", " labels.append({\n", " 'label': str(num),\n", " 'description': description\n", " })\n", "\n", "labels_df = pd.DataFrame(labels)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | label | \n", "description | \n", "
---|---|---|
0 | \n", "1 | \n", "Working at Computer | \n", "
1 | \n", "2 | \n", "Standing Up, Walking and Going updown stairs | \n", "
2 | \n", "3 | \n", "Standing | \n", "
3 | \n", "4 | \n", "Walking | \n", "
4 | \n", "5 | \n", "Going UpDown Stairs | \n", "
5 | \n", "6 | \n", "Walking and Talking with Someone | \n", "
6 | \n", "7 | \n", "Talking while Standing | \n", "
\n", " | sequence | \n", "x | \n", "y | \n", "z | \n", "label | \n", "participant_id | \n", "description | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "0.0 | \n", "1983 | \n", "2438 | \n", "1825 | \n", "1 | \n", "11 | \n", "Working at Computer | \n", "
1 | \n", "1.0 | \n", "1948 | \n", "2442 | \n", "1797 | \n", "1 | \n", "11 | \n", "Working at Computer | \n", "
2 | \n", "2.0 | \n", "1927 | \n", "2388 | \n", "1784 | \n", "1 | \n", "11 | \n", "Working at Computer | \n", "
3 | \n", "3.0 | \n", "1960 | \n", "2319 | \n", "1831 | \n", "1 | \n", "11 | \n", "Working at Computer | \n", "
4 | \n", "4.0 | \n", "1967 | \n", "2274 | \n", "1871 | \n", "1 | \n", "11 | \n", "Working at Computer | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
122195 | \n", "122200.0 | \n", "2063 | \n", "2360 | \n", "2000 | \n", "7 | \n", "4 | \n", "Talking while Standing | \n", "
122196 | \n", "122200.0 | \n", "2056 | \n", "2368 | \n", "2001 | \n", "7 | \n", "4 | \n", "Talking while Standing | \n", "
122197 | \n", "122200.0 | \n", "2059 | \n", "2366 | \n", "2001 | \n", "7 | \n", "4 | \n", "Talking while Standing | \n", "
122198 | \n", "122200.0 | \n", "2063 | \n", "2382 | \n", "2004 | \n", "7 | \n", "4 | \n", "Talking while Standing | \n", "
122199 | \n", "122200.0 | \n", "2071 | \n", "2381 | \n", "2004 | \n", "7 | \n", "4 | \n", "Talking while Standing | \n", "
1923177 rows × 7 columns
\n", "