Cointime

Download App
iOS & Android

A Feature Engineering Case Study in Consistency and Fraud Detection

Validated Venture

Main Takeaways

  • As the world’s largest crypto exchange, it’s crucial we have a risk detection system that is fast yet doesn’t compromise on accuracy. 
  • The challenge we encountered was ensuring our models always used up-to-date information, especially when detecting suspicious account activity in real-time. 
  • To achieve stronger feature consistency and greater production speed, we now make reasonable assumptions about our data and combine our batch and streaming pipelines. 

Discover how our feature engineering pipeline creates strong, consistent features to detect fraudulent withdrawals on the Binance platform. 

Inside our machine learning (ML) pipeline — which you can learn more about in a previous article — we recently built an automated feature engineering pipeline that funnels raw data into reusable online features that can be shared across all risk-related models. 

In the process of building and testing this pipeline, our data scientists encountered an intriguing feature consistency problem: How do we create accurate sets of online features that dynamically change over time?

Consider this real-world scenario: A crypto exchange — in this case, Binance — is trying to detect fraudulent withdrawals before money leaves the platform. One possible solution is to add a feature to your model that detects time lapsed since the user’s last specific operation (e.g., log in or bind mobile). It would look something like this:

user_id|last_bind_google_time_diff_in_days|...

1|3.52|...

The Challenge of Implementation

The number of keys required to calculate and update features in an online feature store is impractical. Using a streaming pipeline, such as Flink, would be impossible since it can only calculate users with records coming into Kafka at the present moment. 

As a compromise, we could use a batch pipeline and accept some delay. Let’s say a model can fetch features from an online feature store and perform real-time inference in around one hour. At the same time, if it takes one hour for a feature store to finish calculating and ingesting data, the batch pipeline would — in theory — solve the problem.

Unfortunately, there’s one glaring issue: using such a batch pipeline is highly time-consuming. This makes finishing within one hour unfeasible when you’re the world’s largest crypto exchange dealing with approximately a hundred million users and a TPS limit for writes.  

We’ve found that the best practice is to make assumptions about our users, thereby shrinking the amount of data going into our feature store. 

Easing the Issue With Practical Assumptions

Online features are ingested in real-time and are constantly changing because they represent the most up-to-date version of an environment. With active Binance users, we cannot afford to use models with outdated features.

It’s imperative that our system flags any suspicious withdrawals as soon as possible. Any added delay, even by a few minutes, means more time for a malicious actor to get away with their crimes. 

So, for the sake of efficiency, we assume recent logins hold relatively higher risk:

  • We find (250 days + 0.125[3/24 delay] day) produces relatively smaller errors than (1 day +  0.125[3/24 delay] day).
  • Most operations won’t exceed a certain threshold; let’s say 365 days. To save time and computing resources, we omit users who haven’t logged in for over a year. 

Our Solution

We use lambda architecture, which entails a process where we combine batch and streaming pipelines, to achieve stronger feature consistency.

What does the solution look like conceptually?

  • Batch Pipeline: Performs feature engineering for a massive user base.
  • Streaming Pipeline: Remedies batch pipeline delay time for recent logins.

What if a record is ingested into the online feature store between the delay time in batch ingestion?

Our features still maintain strong consistency even when records are ingested during the one-hour batch ingestion delay period. This is because the online feature store we use at Binance returns the latest value based on the event_time you specify when retrieving the value.

Comments

All Comments

Recommended for you

  • BTC Surpasses $74,000

    Market data shows that BTC has surpassed $74,000, currently priced at $74,011.04, with a 24-hour decline of 0.35%. The market is experiencing significant volatility, so please ensure proper risk management.

  • First Windows PCs with NVIDIA Chips Expected to Debut Next Week

    On May 30, Axios reported that sources indicate NVIDIA is set to enter the personal computer market, with the first Windows PCs featuring its chips as the main processors expected to be unveiled next week. NVIDIA and Microsoft will showcase their collaborative results and the initial PCs equipped with these chips at two major industry events: Computex in Taipei and the Microsoft Build Developer Conference. Sources suggest that PCs with NVIDIA chips are likely to appear in Microsoft's Surface brand as well as products from other manufacturers, including Dell. Microsoft is also expected to launch software that will allow users to more easily run AI agents locally on Windows PCs.

  • This Week, US Spot Bitcoin ETFs Experience Net Outflows of $1.4156 Billion

    On May 30, according to Farside monitoring, US spot Bitcoin ETFs experienced cumulative net outflows of $1.4156 billion this week. This includes: IBIT with net outflows of $966.3 million; GBTC with net outflows of $172 million; FBTC with net outflows of $169.1 million; BITB with net outflows of $46.3 million; ARKB with net outflows of $24.7 million; MSBT with net outflows of $1 million; and Grayscale BTC with net outflows of $33 million.

  • US Oil Giant Predicts Higher Oil Prices This Summer

    On May 30, according to CCTV Finance, during a conference hosted by investment firm Bernstein, Chevron CEO Mike Wirth stated that due to the situation in Iran, global crude oil inventories are continuously declining, and oil prices are likely to rise in the next two months. The Financial Times reported that Wirth's remarks reflect widespread concerns: even if the conflicting parties reach a ceasefire agreement, the negative impact of the conflict on energy prices will persist for months. Additionally, CNN reported on the 28th that due to the ongoing geopolitical conflicts in the Middle East, the U.S. Strategic Petroleum Reserve is declining at a rare pace not seen in recent years, and commercial crude oil inventories are also at low levels.

  • S&P 500 Index Set for Rare Nine-Week Winning Streak

    On May 29, hopes that a ceasefire agreement could bring an end to the Middle East conflict have propelled the U.S. stock market towards a rare weekly winning streak record, with a surge in artificial intelligence trading also boosting the market. The S&P 500 index has rebounded nearly 20% from the lows triggered by the war and is poised for its ninth consecutive week of gains, marking the longest winning streak since December 2023. Such a rare occurrence has only happened a few times since 1985. On Friday, the index edged higher, hovering near record highs.

  • Grayscale to Introduce $115 Million HYPE Token Seed Funding for Hyperliquid Staking ETF

    On May 29, Finance Feeds reported that Grayscale is in talks with Hyper Holdings Global LP to sell shares of its proposed Hyperliquid ETF in exchange for approximately 2 million HYPE tokens, valued at about $115 million at current prices, to serve as seed capital before the fund's listing. At the same time, Grayscale has renamed the product to 'Grayscale Hyperliquid Staking ETF', which is set to be listed on NASDAQ under the ticker HYPG. The new staking feature distinguishes it from a traditional spot ETF that solely tracks token prices.

  • BTC Falls Below $73,000

    Market data shows that BTC has fallen below $73,000, currently priced at $72,999.33, with a 24-hour decline of 0.4%. The market is experiencing significant volatility, so please ensure proper risk management.

  • Spot Gold Reaches $4,550/oz, Up 1.20% for the Day

    Spot gold has reached $4,550 per ounce, rising 1.20% for the day.

  • S&P 500 Technology Sector Hits Record High, Up 1.7%

    On May 29, it was reported that the S&P 500 technology sector has reached a historic high, currently up 1.7%.

  • U.S. Stock Indices Open Slightly Higher; Dell Rises Over 30%

    On May 29, U.S. stocks opened with the three major indices slightly higher, with the Dow Jones up 0.18%, the S&P 500 up 0.09%, and the Nasdaq up 0.16%. Dell (DELL.N) surged over 30% as its first-quarter earnings exceeded expectations. Stocks of AI server manufacturers also rose, with Super Micro Computer (SMCI.O) up over 7% and HP (HPQ.N) up over 6%.