In the last article, Sparse Matrix Multiplication with Elasticsearch and Apache Spark, we went through a method of doing large-scale, sparse matrix multiplication using Apache Spark and Elasticsearch clusters in the cloud. This article is really an extension to that tutorial, in which we will generalize the method to rectangular matrices. This is important because we are crafting a future article that will make use of this technique. We also need it to function for non-square matrices.


Introductory note: Sloan Ahrens is a co-founder of Qbox who is now a freelance data consultant. This is the fifth in a series of guest posts, in which he demonstrates how to set up a large scale machine learning infrastructure using Apache Spark and Elasticsearch. -Mark Brandon


Sparse Matrix Multiplication with Elasticsearch and Apache Spark, we went through a method of doing large-scale, sparse matrix multiplication using Apache Spark and Elasticsearch clusters in the cloud. This article is really an extension to that tutorial, in which we will generalize the method to rectangular matrices. This is important because we are crafting a future article that will make use of this technique. We also need it to function for non-square matrices.

artificial_neural_networkIn the matrix multiplication tutorial, all matrices of are square. That is, the number of rows equals the number of columns. Here, we'll make a few simple changes that will generalize the code to the case of an MxN matrix A multiplied by an NxP matrix B, yielding an MxP matrix C. Since the changes are quite simple, this will be a relatively short article.

We went through the steps for deploying the code to the cloud in the last article. Because generalizing the code to multiply rectangular matrices does not change deployment in any significant way, we omit any additional deployment details and illustrate with running the code locally on our Ubuntu VM (see previous segment in this series). As always, the code is available in the GitHub repository.

Generating Random Rectangular Matrices

We begin start with a update to version of the code that generates the random, sparse matrices, random_rect_mm.py (the original version is here). We can generalize the function createRandomSparseMatrix by substituting parameters for the number of rows and the number of columns, as follows:

def createRandomSparseMatrix(es_client, index_name, M, N, elem_range, shards, D):
createMatrixIndex(es_client, index_name, shards)
bulk_data = [] 
num_of_elements = int(round(D * M*N))
for elem_num in xrange(num_of_elements):
    # generate random row and column indices, and element value
    i = randint(1, M)
    j = randint(1, N)
    cell_val = randrange(-elem_range, elem_range, 1)
    # only store non-zero values
    if cell_val == 0:
        continue
    # use the ES bulk API
    bulk_data.append({
        "index": {
            "_index": index_name, 
            "_type": 'elem', 
            "_id": '%s-%s' % (i,j)
        }
    })
    bulk_data.append({
        'row': i,
        'col': j,
        'val': cell_val
    })
    if len(bulk_data) > 10000:
        res = es_client.bulk(index=index_name,body=bulk_data,refresh=True)
        bulk_data = []
if len(bulk_data) > 0:
    res = es_client.bulk(index=index_name,body=bulk_data,refresh=True)

Next, after defining values for M, N, and P, we can create our matrices as follows:

D = 1
M = 15
N = 5
P = 10
createMatrixIndex(es_client, 'matrix-c', shards)
createRandomSparseMatrix(es_client, 'matrix-a', M, N, 10, shards, D)
createRandomSparseMatrix(es_client, 'matrix-b', N, P, 10, shards, D)

We chose these small values for illustration, so that we can easily test our results against R, as we did last time.

Another change to note is the command that calls the Spark code. We p previouslyassed N as a command line parameter, but now we need to pass M, N and P:

master_path = 'local[4]'
jar_path = '~/spark/jars/elasticsearch-hadoop-2.1.0.Beta2.jar'
code_path = '~/qbox-blog-code/ch_5_rect_matmult/es_spark_rect_mm.py'
system("~/spark/bin/spark-submit --master %s --jars %s %s %s %s %s" % 
    (master_path, jar_path, code_path, M, N, P))

The other code changes are minor, which you can review in the full file here.

NOTE: We remove the loop that generates successive values of matrix size, since our objective in this article is to generalize the matrix multiplication technique itself.

Rectangular Matrix Multiplication with Spark

You can find the general version of the Spark code here. The first change to note is that the three matrix parameters need to be read in from the command line arguments:

M = int(sys.argv[1])
N = int(sys.argv[2])
P = int(sys.argv[3])

Determining the replication factor G and the size of the groups is a bit more complex, now that we have three different numbers describing our matrices. There are a number of ways that we could handle the issue, and we encourage you to experiment with different ones. The method we chose is merely to take the maximum of the three parameters, and then compute the replication factor and group size in the same way as before:

# D is the average density of the input matrices
D = (matA_count + matB_count) / float(M*N + N*P)
max_dim = max(M,N,P)
# G is the replication factor
G = int(round(sqrt(sqrt(D * max_dim**2 / 2))))
# GRP_SIZE is the number of rows/cols in each grouping
GRP_SIZE = int(ceil(max_dim / float(G)))

We need to update the code that computes statistics for the matrices to reflect the rectangular structure of the matrices. Though we don't present that detail here, you can see it in the repo. We want to be able to test the changes against R as we did in the last article, so we update the code that prints out the matrices as follows:

# this section is a way to print out the matrices validation
# can only be used with small matrices, for obvious reasons
##########################
matA = matA_rdd.map(lambda i: ((i[1]['row'],i[1]['col']), i[1]['val'])).collect()
matB = matB_rdd.map(lambda i: ((i[1]['row'],i[1]['col']), i[1]['val'])).collect()
matC = matrix_C.collect()
def print_matrix(A, M, N):
    matrix = [[0 for j in range(N)] for i in range(M)]
    for result in A:
        row = result[0][0]
        col = result[0][1]
        matrix[row-1][col-1] = result[1]
    for i in range(M):
        print(','.join([str(matrix[i][j]) for j in range(N)]) + ',')
print('A:')
print_matrix(matA, M, N)
print('B:')
print_matrix(matB, N, P)
print('C:')
print_matrix(matC, M, P)
##########################

So now we can test the code to make sure it's working as advertised. We can run the code with:

cd ~/qbox-blog-code/ch_5_rect_matmult  # or equivalent
python random_rect_mm.py

This is the output:

rect_mat_cw_4

Copying these two matrices into RStudio, and running the matrix multiplication as before, yields the following:

rect_mat_r_4-1

We've come to the end of this article. To accommodate rectangular matrices, we only had to make a few simple changes to generalize the square matrix multiplication code from the previous article. This will be quite convenient in a future article, in which we plan to use this technique as a building block for machine learning algorithms.


comments powered by Disqus