Category: TalkingData
What are the difficulties in self-service associated queries?
Things are universally interconnected, and many queries with business significance also involve the association of multiple data tables. BI software usually provides self-service query function, and some software can also support associated queries, but most of them are actually single table, and the associated query function is rarely used by business personnel. Queries involving associated tables often require technical personnel to prepare in advance, which is commonly referred to as wide table. Business personnel usually only perform queries based on a single wide table. Associated queries are the weak point of almost all BI software, whether it’s a big name
How many data analysis tasks can BI software handle?
Actually, not much! From the early call for multidimensional analysis to the recent call for agile BI, BI vendors have been emphasizing self-service capabilities, claiming that business personnel can analyze data at their own discretion, and users often have strong needs. Both parties can easily form purchasing behavior as soon as they reach an agreement. However, in terms of the expected job content for most users who lack BI application experience, the goal of self-service can be said to be far from achieved! From experience, the best situation can only solve about 30% of the problems, while most BI products
What kind of OLAP do we need?
The word OLAP literally means online analysis, which means that personnel perform various interactive analysis operations on data. However, the current concept of OLAP has been severely narrowed down by BI software. When it comes to OLAP on business analysis, it often only has the function of multidimensional analysis in technology, which is to summarize a pre-built data cube according to the specified dimension level and present it as a table or graph, supplemented by operations such as drilling, aggregation, rotation, slicing, etc. to change the dimension level and summary range. These are familiar to everyone, so we won’t go
The difficulty of SQL stems from relational algebra
In the field of structured data computing, SQL is still the most widely used working language, not only adopted by all relational databases, but also targeted by many new big data platforms. For a certain computing technology, people usually care about two efficiencies. One is the descriptive efficiency of operations, and the other is the execution efficiency of operations. This is easy to understand. If the descriptive efficiency is too low, it means that the development cost is too high and it is difficult to write programs for calculation; If the execution efficiency is low, it takes a long time
First Half Ordered Sorting and Ordered Cursor
Having encountered such a case, the situation can be simplified and summarized as follows: there is a table T in the database, which has two important fields a and b, a is a timestamp, accurate to seconds; b is the user ID; Other fields are used to represent the event attributes that occurred for user b at time a. The current task is to export the data sorted by a and b. Simply put, it means writing out the result set of SELECT * From T ORDER BY a, b to a file. However, this T table has billions of
Multi-purpose traversal
Reducing external storage (hard disk) access has always been an eternal topic for improving big data computing performance. We have also discussed methods such as columnar storage and compression that directly reduce access and even storage. In addition to these storage level methods, methods can also be found to reduce external storage access in the algorithm and computational implementation stages. Traversing is an essential part of big data computing. Sometimes, we may find that in a computing task, there are two (or more) traversal actions involving the same batch of data. If we can find a way to merge two
Positioning function and positioning calculation
Filtering is a very basic operation, which is to find a subset of members from a set that meet certain conditions. There are WHERE and HAVING in SQL to implement filtering operations (strictly speaking, due to that SQL lacks discreteness, the result of WHERE operation is a replica of the subset of the original data table). SPL also has a select function to implement operations similar to WHERE in SQL (SPL that supports discreteness returns the true subset). For unordered sets, the filtering operation only needs to find members that meet the conditions, and there is no need for any
Iterative functions and custom aggregation operations
The conventional aggregation operations we have discussed, such as SUM/COUNT and unconventional aggregation operations, such as maxp/top, are pre designed aggregation functions. But what if we want to implement an operation that has not been defined before? For example, if we want to perform a multiplication operation, which is obviously an aggregation, can we only write it ourselves using a loop? Can we combine it with existing syntax and functions? (Off topic: Multiplication can be done using exp(SUM(ln(x))), but it’s a bit tricky, and it can’t handle situations where members are negative.). To design such a syntax scheme, we first
Unconventional aggregation
The standard SQL provides five most commonly used aggregation operations: SUM/COUNT/AVG/MIN/MAX. Observing these operations, we find that they can all be seen as a function that returns a single value with a set as a parameter. Let’s first understand this commonality as the definition of aggregation operation, which involves turning a set to a single value, that is to turn multiple values to one, resulting in aggregation. Therefore, it is called aggregation operation. Then obviously, aggregation operations can be applied when there is a set, so operations like SUM/COUNT can be implemented on a data table (record set). The result







