Technical Practice | Realizing Real-Time Data Warehouses Construction with Flink Technology Components
Sizzling News
2022.10.28

As the development of the Internet enters the second half, the timeliness of data is becoming more important to achieve refined operations of enterprises. Shopping malls are like battlefields, and in the massive data generated every day, the ability to mine valuable information in real-time and quickly reach customers is of great help to the adjustment of operational decisions and user experience of enterprises. In order to feedback data more efficiently, support decision-making in a timelier manner, and maximize the value of data, enterprises have begun to explore the construction of real-time data warehouses to meet the needs of rapid data acquisition.

Real-time data warehouse integrates four functions: real-time data collection, real-time data processing, offline data correction and data customized display; which can support the needs of real-time business analysis, real-time marketing, real-time risk control and other scenarios. The emergence of new scenarios has led to the emergence of new technologies, and the rise of a new generation real-time computing engine Flink has also promoted the development of real-time data warehouses under the characteristics of ultra-high performance, data consistency assurance, and SQL-based programming. The real-time data warehouse based on Flink architecture provides a data foundation for various real-time application scenarios and plays a crucial role in the data middle office system.

As a leading financial technology enterprise, Sunline has made a lot of attempts and practices on the construction of real-time data warehouses. This article will take real-time data warehouse construction project of a city commercial bank participated by Sunline as an example to share the experience of building a real-time data warehouse based on Flink technology components.

Real-time digital warehouse construction plan for a city commercial bank

In the case of real-time data warehouse construction to ensure data accuracy, the highest priority is to ensure the real-time data performance, so Kafka, Hbase and other technical components with high read and write efficiency have become the first choice for real-time data warehouse data exchange components. According to the characteristics of the selection components and the requirements of real-time, the layering of the real-time data warehouse architecture needs to pay attention to the following points:

1. Simplifying the link and reducing the data processing link as much as possible to maximize real-time data performance.

2. Real-time data transfer from data warehouse to Kafka to structure message queues with each layer of the data warehouse needs to have a clear division of labor, realizing convenience for operators to trace and locate data.

3. Prepare offline data for real-time data verification to prevent calculation errors or omissions.

4. Because real-time data warehouse tasks are all 7x24 uninterrupted operations, offline data will ensure query service can return data normally in the unexpected situation at any certain layer of link terminals and the real-time data warehouse is broken.

In summary, the real-time data warehouse construction is divided into 4 layers, and the specific description of each layer is as follows:

1. RTL: Technical analysis layer, which collects data from various business sources through data collection tools, ensuring data structure is consistent.

2. ROL: Sticker layer which is divided into two areas: real-time area and offline area, the real-time area stores data that has undergone certain cleaning/standardization while the offline area stores offline dimensional data synchronized daily.

3. RCL: Summary sharing layer, storing light summary and data that can be shared and data is classified according to certain rules to achieve the reuse purpose.

4. RDL: Data service layer, which indexes RCL layer data, including analysis, application summary indicators, and detailed data. It is divided into real-time area and offline area, the real-time area stores real-time indicator results, and the offline area stores daily offline indicator results for data guarantee and real-time data verification.

Real-time data processing links different scenarios

Real-time data applications include real-time indicator calculation, streaming, real-time risk control, real-time marketing, real-time customer matching, and other business scenarios. Based on the FlinkSQL+OLAP production link, real-time data is synchronized to the message queue for carrying through the collection tool, and then processed by the real-time data warehouse to be landed to various types of storage, and finally received and processed by the downstream business system or pushed to various terminals by the data service platform for display, and the whole link takes into account the timeliness of data and efficiency of query.

In the real-time data warehouse construction project of a city commercial bank, its main business scenarios include real-time assets and liabilities, real-time management cockpit, and real-time regulatory data monitoring.

Scenario 1 - Real-time assets and liabilities: Obtaining the real-time balance of users of each business system after moving accounts, supplementing the data to the result table by correlating relevant dimension information, and then query the real-time result table to return the data by the front-end query.

By constructing two Hbase result tables in real-time and offline, comparing them in real-time to meet the user's real-time query needs for real-time transactions and immobile accounts and with the original query scheme of "yesterday's offline balance + today's real-time amount", the timeliness of query results can be improved, and the problem of timeliness of data in the empty window period of daily cutting time and batch running period can be avoided.

Scenario 2 - Real-time management of the cockpit: Real-time transaction accounts are obtained and indicators such as capital inflow and outflow, customer assets, loan application amount, and number of people are displayed in real-time at the bank-wide level.

Taking real-time capital inflow and outflow statistics as an example, the daily transaction account data is grouped and aggregated according to the indicator granularity to ensure that each granularity is the same as the rowkey at the RDL layer, and each time a moving account information is obtained, the latest fund amount is calculated and the Hbase is updated in real-time according to the rowkey Table result data.

Scenario 3 - Real-time regulatory data monitoring: The daily transaction flow is monitored according to the prescribed regulatory hit logic, and the data that meets the hit logic is sent downstream for transaction restrictions.

Taking gambling and fraud as an example, the logic of supervision is that in a non-counter system, fund transactions involving 5 different usernames and above such as collection and transfer occur consecutively, and the interval between each transaction is no more than 3 minutes, and the amount of each transaction is RMB 0.01-10. Based on Flink's over window function and event time, such scenarios can determine whether the hit logic is satisfied by calculating the number of counterparty customers and transaction amount in a period of time before and after the transaction flow, so as to mark the hit and send it to the downstream system.

By using Flink technology to build a real-time data warehouse, we abstract all aspects of data production into a real-time data warehouse architecture, realize the unification of data sources for full-stack real-time data applications, ensure the consistency of application data indicators and dimensions, and greatly improve the convenience and timeliness of obtaining real-time data, thereby improving the overall operational efficiency of customers.

In the current tide of digital transformation, the construction of real-time data warehouse is an important part of the data middle office system and is of great significance to financial institutions. Sunline has rich implementation experience in the field of real-time data processing, and has provided real-time data processing solutions for banks of different sizes such as Bank of China, Bank of Nanjing, and Bank of Liuzhou. In the future, Sunline will explore more new business forms in the field of real-time data warehouses to help customers efficiently tap the value of data and improve their business development.


Let’s spread China's financial technology globally
Sunline can empower your digital transformation