Unsupervised Machine Learning with JavaScript

Featured

In a previous article we discussed supervised machine learning (to apply natural language processing on documents) and this time we'll take a look at an unsupervised machine learning technique called k-NN.

k-NN

k-NN or KNN or even knn is short for 'k nearest neighbours' and it is a clustering classification system that we can leverage to classify data. In a classic supervised machine learning example we have training data - it's data that we give to our system so that it can train itself, and later on predict future values.

k-NN applies clustering to the data, and it does that in 4 steps. First and foremost we need to decide how many clusters we want to create, let's for the sake of argument say 2. (We'll see later on how to pick the appropriate amount of clusters).

Let's also assume that we have coordinate system where we can place our datapoints.

For each cluster (2) let's place a random point to the coordinate system as well, and measure the distance between the datapoints and the random point (also called a centroid) and assign the closest points to their closest centroids.

Now, take the random points (the centroids) and move them so that they are in the middle of the clusters.

Let's repeate the previous step, assign each datapoint to the closest centroid point again. (Let's keep on doing this after the centroids do not change and there's no more reassignment needed for the datapoints).

Once there's no further change in the reassignment of the data, the algorithm has successfully created the two clusters and have been able to classify the data without any instructions.

Unsupervised machine learning

With unsupervised machine learning we only give data to the algorightm and it should be able to classify it, which means that later on we can provide it with some datapoints, and it should be able to tell us the meaning of that data based on it's previous classification.

We'll go through a very simple example. Imagine that we work at a bank as a data scientist and we have access to data from a bank where we have gathered people's monthly salaries and the amount that they spend on their credit cards per month. We could take this information and categorise people into categories: people who earn less but spend more - we should be careful with them as they may have issues with repayments. On the other hand, people who spend a lot and who have high salaries could be targeted by a campaing of some sort.

Of course we'll use k-NN clustering to create these labels.

How many clusters?

Deciding on the number clusters can be achieved by what is called the 'elbow method'. It's important to make the right decision regarding the number of clusters to use for categorisation, as the K Nearest Neighbour algorithm is somewhat naive in a sense that it will classify data to as many clusters as you create. But there's a way that we can utilise to find the sweet spot.

The idea is simple: we iteratively go through a number of clusters (k = [1, 10] for example) and for each k value, we calculate the sum of squared errors. These sums, when plotted on a line chart, will look like an arm (it is a non-linear function). In this chart what we are looking for is the place where the arm has an elbow - a change of errors sums.

Calculating the sum of squared errors is really simple, take the values say the salaries of people: 1500, 1510, 1700, 1400, 1600, 1455. (This is only a small sample). We can now calculate the mean of these values: (1500 + 1510 + 1700 + 1400 + 1600 + 1455) / 6 = 1527.5. We can then calculate the deviation, how much the values deviate from the mean: 1500 - 1527.5 -> 27.5, 1510 - 1527.5 -> 17.5 etc. And now these values can be squared: 27.5², 17.5² etc. Finally we can add these squared errors together so: 27.5² + 17.5² + etc and the final number is the SSE - or the sum of squared errors.

elbow-method
The figure above contains a visual of how the elbow method looks like for a non-linear chart for the dataset that we are using. So according to that chart, the sweetspot is 4 meaning that we need to create 4 clusters, or in mathematical terms: K = 4.

The data

Let's take a look at an example document that stores the information that we'll be using to do the analysis:

(Note that the data has been automatically generated - this is your all people fictious disclaimer.)

{
    "name": "Allison Stokes", 
    "gender": "female", 
    "email": "allisonstokes@zilladyne.com", 
    "phone": "+1 (869) 597-2480", 
    "address": "936 Berriman Street, Zarephath, Alaska, 2799", 
    "salary": 4143, 
    "creditCardSpend": 7193
}

We will be using the salary and the creditCardSpend values and feed that to the clustering algorithm.

The application's architecture

Since these JSON documents are stored in a NoSQL database we need to also use a connector to retrieve the documents and in our case, since we are using MarkLogic, we'll be using the MarkLogic Node.js Client API to connect to the database and to query for the documents.

We'll also be using Express as a webserver and we'll create an API endpoint to return the clustered data, as well as serve an index.html file where we'll plot the clusters on a chart.

We'll use Google's scatter charts to display the datapoints and their clusters.

Use k-NN from Node.js

There are multiple packages available for using the k-NN algorithm in Node.js via npm. We will be using clusters - it's easy to use and the return data is easily understood.

The code

First and foremost we need to query the database and add the returned data to the clustering algorithm so that it can classify it. Remember, we have decided that based on our dataset, we should be using 4 clusters. The /api endpoint is responsible for querying for the data and adding that to the algorithm and it returns the cluster data as it's response.

// app.js - code snippet
const clusterMaker = require('clusters');
clusterMaker.k(4);
clusterMaker.iterations(1000);

app.get('/api', (req, res) => {
  db.documents.query(
    qb.where(
      qb.directory('/client/')
    ).slice(0, 500) //take 500 documents as samples
  ).result().then(documents => {
    const response = documents.map(document => {
      return [document.content.salary, document.content.creditCardSpend];
    });
    clusterMaker.data(response);
    const clusters = clusterMaker.clusters();
    res.json(clusters);
  });
});

The structure of the clusters variable is really easy to understand and straight-forward:

[ { centroid: [ 3283, 2767.5 ],
    points: [ [ 3107, 2563 ], [ 3154, 2453 ], [ 3043, 2179 ], [ 3828, 3875 ] ] },
  { centroid: [ 4765.5, 2444 ],
    points: [ [ 4651, 2471 ], [ 4880, 2417 ] ] },
  { centroid: [ 6579, 2079 ], points: [ [ 6579, 2079 ] ] },
  { centroid: [ 6209.666666666667, 5707.333333333333 ],
    points: [ [ 6644, 6402 ], [ 6083, 5238 ], [ 5902, 5482 ] ] } ]

We have 4 clusters, which means we have 4 centroids, and we get a list of points belonging to each cluster. This is the information that we'll now take and plot on our scatter chart:

$.get('/api', apiData => {
    google.charts.load('current', {'packages': ['corechart'] });
    google.charts.setOnLoadCallback(drawChart);
    function drawChart() {
      const chartData = [];
      chartData.push(['Salary (£ pcm)', 'Credit Card Spend (£ pcm)', { type: 'string', role: 'style' }]);
      apiData.map((elements, index) => {
        let colour;
        if (index === 0) {
          colour = 'red';
        } else if (index === 1) {
          colour = 'green';
        } else if (index === 2) {
          colour = 'orange';
        } else {
          colour = 'blue';
        }
        elements.centroid.push(`point { size: 8; shape-type: circle; fill-color: ${colour} }`);
        chartData.push(elements.centroid);
        return elements.points.map(point => {
          point.push(`point { fill-color: ${colour}`);
          return chartData.push(point);
        });
      });
      const data = google.visualization.arrayToDataTable(chartData);

      const options = {
        title: 'Montly Salary vs Monthly Credit Card spend',
        hAxis: { title: 'Salary (£ pcm)' },
        vAxis: { title: 'Credit Card Spend (£ pcm)' },
        legend: 'none',
        pointSize: 2,
        hAxis: {
          minValue: 3000,
          viewWindow: {
            min: 2800
          }
        }
      };

      const chart = new google.visualization.ScatterChart(document.getElementById('chart_div'));

      chart.draw(data, options);
    }
});

And this is the finished product:

Screen-Shot-2018-02-06-at-20.35.03

We have the 4 clusters on the scatter chart, with the larger points representing the centroids.

As a next step we could now act on this clustered information and use the information visible to us now: people in the blue cluster could be targeted by direct marketing campaings to encourage them to do a purchase, people in the green cluster could be warned not to overspend.