SPL Speeds up Cluster Analysis of Celestial Bodies by 2000 Times

Task description

A cluster analysis task from the National Astronomical Observatories: There is a total of 11 data tables. Each table has over 5 million records extracted from one photo, and each record consists of celestial coordinates and properties. In the 11 “photos”, there are similar but not identical celestial coordinates, which specify locations that are not far apart. The task is to use one of the “photos” as the base, find similar celestial bodies from the other photos, use their average coordinates and properties as the final values – celestial bodies very close to each other are clustered into one group – to perform aggregation operations, and obtain a “photo” with clearer coordinates and more precise information.

Task analysis

The task is not complex. Just loop through each pair of celestial coordinates in the base photo, compute distance between it and each pair of celestial coordinates in the other photos, identify celestial bodies as one group and the same star if the distance is shorter than a certain threshold value, and compute average of all celestial coordinates in each group and use it as the star’s final coordinates.

But when handling the task with the computer we find that it is incredibly compute-intensive. The base photo needs to be looped over 5 million times, during which distances between each pair of celestial coordinates and over 50 million coordinates in the other photos are computed. The degree of computational complexity is 5 million+ * 50 million+, which is an astronomical number.

It’s true. In the test, the amount of data of each photo is reduced 10 times, which means the number of coordinate pairs in each photo is 0.5 million. It takes Python 6.5 days to get result of the 11-photo cluster analysis using the above computing logic. And it would take 650 days to finish handling the over 5 million coordinate pairs, which are 100 times greater in terms of computational complexity. This is unbearably long.

The same 0.5 million records are also loaded to a distributed database and computed with SQL. The computation is driven by 100 CPUs and finished after 3.8 hours processing, which are many times faster than Python. But considering Python’s 6.5 days are the result of single-threaded processing, SQL’s single core performance (3.8 hours * 100 > 6.5 days) is actually lower than Python’s. This is already unbearably resource-intensive. And it takes 380 hours to process the over 5 million records.

Solution

Let’s look at which part of the computing process can be optimized to reduce the amount of computation.

The celestial coordinates in the base photo must be looped through to make sure each celestial body participates in the cluster analysis. There is no need to loop over every pair of celestial coordinates in each of the other photos. We only need to find the nearby coordinates for each pair of base celestial coordinates. Binary search is suitable for doing this type of task by significantly reducing the amount of computation.

Here is the process. Sort celestial coordinates in each photo, and use binary search to find coordinates lower than a certain threshold value to exclude most of the celestial bodies. This is the preliminary filtering. Then compute distances between each base celestial body and every celestial body in the filtering result set to get eligible celestial bodies. This is the second filtering.

Let’s look at the compute intensity of the preliminary filtering plus the second filtering. The computation amount of sorting all the 10 photos one by one is “5 million *log(5 million)*10”; and that of the preliminary binary search-based filtering is also “5 million *log(5 million)*10”. The computation amount of the second filtering is unfixed. But according to the experience, the number of result records of the preliminary filtering usually will not exceed 10000. So the computation amount in the first filtering should be “log(5 million)+10000”. As a result, the total computation amount is generally “5 million *log(5 million)*10+5 million *(log(5 million)+10000 )*10”, which is only 0.2 percent of the computation amount involved in the previous algorithm.

Technology selection

We also need to choose a matching program tool for the algorithm. Python is used in the previous test. The language is powerful with its ready-made framework for astronomical calculations. To compute the distance, for example, we just call the corresponding class library. It’ s convenient.

Yet Python has serious weaknesses:

1. It does not have a native binary search method. It needs to use a third-party class library that needs to work with Pandas. And data conversions are involved during the process. All these bring extra overheads.

2. The Python multithreaded processing isn’t truly parallel. Actually, the language does not support multithreading. This is why Python isn’t qualified to handle the task.

The RDB’s SQL also cannot handle the task efficiently. This cluster analysis is in essence a non-equi join. The database can use an optimization method, such as HASH JOIN, to reduce the compute intensity of an equi-join operation, but to deal with the non-equi joins, they can only turn to the traversal strategy. Moreover, SQL cannot express the above-mentioned complex logic, and it cannot identify monotonicity of the distance and thus actively perform sorting and then the binary search. Plus, it isn’t good at performing mathematical operations (the distance calculation involves trigonometric functions). As a result, SQL’s single-core performance is lower than Python’s, as the above test shows.

High-level languages like Java can implement both the binary search and parallel processing well, but the code is cumbersome and development efficiency is too low, making program maintenance particularly difficult.

Well, is there another tool we can use to deal with the task?

esProc SPL is a great choice. It offers a lot of high-performance algorithms (including binary search), supports multithreaded processing, produces simple and clear code, and provides user-friendly visual debugging mechanism that helps increase development efficiency and reduce maintenance costs.

Actual effect

In handling this task, SPL is 2000 times faster than Python. This is the joint outcome of binary search, which increases efficiency by over 500 times, and multithreaded processing, which boosts efficiency by 4 times (as the computer on which the test runs only has a 4-core processor). It only takes SPL several hours to complete this cluster analysis task involving a data size of 5 million+ records.

Not only does the SPL code have excellent performance, but also it is concise. The key code only covers 23 lines.

ABCDE
1=RThd/ Distance threshold
2=NJob=4/ Number of parallel threads
3=file("BasePhoto.csv").import@tc()
4=directory@p(OtherPhotos)/ Paths of other photos
5for A4=file(A4).import@tc()/ Other photos
6=B5.sort@m(OnOrbitDec)/Sort
7=B6.min(DEC)
8=delta_ra=F(B7,RThd)/ Compute RA threshold value according to DEC
9=FK(B5,NJob)/ Segment data index
10fork B9=B5(B10)/ Photo segment
11for A3=C11.OnOrbitDec/DEC
12=D11-delta_rad/Lower DEC limit
13=D11+delta_rad/Upper DEC limit
14=C11.RA/RA
15=D14-delta_ra/Lower LRA limit
16=D14+delta_ra/Upper RA limit
17=C10.select@b(between@b(OnOrbitDec,D12:D13))/Perform binary search to get DEC
18=D17.select(RA>=D10&&RA<=D11)/Search for RA
19=D18.select(Dis(~,C11)<=A7)/Second filtering
20if D19!=[]/Accumulated result
21=FC(C11,D37)
22=@|B10/ Concatenate results
23=file(OFile).export@tc(B22)/ Export result

The fork statement in B10 is a multithreaded processing function that executes the algorithm in segments.

In B6, sort@m() is a parallel sorting method that helps increase efficiency when a large volume of data is involved. Ordered data is the prerequisite of the using the binary search.

In C17, select@b(...) function is the binary search method, which is the key of speeding up the computation.

Summary

Performance optimization relies on high-performance algorithms. Only when the compute intensity is reduced that the computational efficiency can be effectively increased. And high-performance algorithms need to be collected during the everyday analysis work. Find commonly used SPL optimization algorithms in Performance Optimization.

High-performance algorithms require high-efficiency programming tools. As previously mentioned, Python, SQL and Java all have their own shortcomings, ranging from non-parallelizable, difficult-to-implement to hard-to-maintain. SPL offers a rich collection of basic algorithms and allows high-concurrency, which enable to produce concise code, and provides user-friendly visual debugging mechanism that help effectively increase development efficiency and reduce maintenance costs.

Leave a Reply

Discover more from esProc SPL Official Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading