Cointime

Download App
iOS & Android

A Feature Engineering Case Study in Consistency and Fraud Detection

Validated Venture

Main Takeaways

  • As the world’s largest crypto exchange, it’s crucial we have a risk detection system that is fast yet doesn’t compromise on accuracy. 
  • The challenge we encountered was ensuring our models always used up-to-date information, especially when detecting suspicious account activity in real-time. 
  • To achieve stronger feature consistency and greater production speed, we now make reasonable assumptions about our data and combine our batch and streaming pipelines. 

Discover how our feature engineering pipeline creates strong, consistent features to detect fraudulent withdrawals on the Binance platform. 

Inside our machine learning (ML) pipeline — which you can learn more about in a previous article — we recently built an automated feature engineering pipeline that funnels raw data into reusable online features that can be shared across all risk-related models. 

In the process of building and testing this pipeline, our data scientists encountered an intriguing feature consistency problem: How do we create accurate sets of online features that dynamically change over time?

Consider this real-world scenario: A crypto exchange — in this case, Binance — is trying to detect fraudulent withdrawals before money leaves the platform. One possible solution is to add a feature to your model that detects time lapsed since the user’s last specific operation (e.g., log in or bind mobile). It would look something like this:

user_id|last_bind_google_time_diff_in_days|...

1|3.52|...

The Challenge of Implementation

The number of keys required to calculate and update features in an online feature store is impractical. Using a streaming pipeline, such as Flink, would be impossible since it can only calculate users with records coming into Kafka at the present moment. 

As a compromise, we could use a batch pipeline and accept some delay. Let’s say a model can fetch features from an online feature store and perform real-time inference in around one hour. At the same time, if it takes one hour for a feature store to finish calculating and ingesting data, the batch pipeline would — in theory — solve the problem.

Unfortunately, there’s one glaring issue: using such a batch pipeline is highly time-consuming. This makes finishing within one hour unfeasible when you’re the world’s largest crypto exchange dealing with approximately a hundred million users and a TPS limit for writes.  

We’ve found that the best practice is to make assumptions about our users, thereby shrinking the amount of data going into our feature store. 

Easing the Issue With Practical Assumptions

Online features are ingested in real-time and are constantly changing because they represent the most up-to-date version of an environment. With active Binance users, we cannot afford to use models with outdated features.

It’s imperative that our system flags any suspicious withdrawals as soon as possible. Any added delay, even by a few minutes, means more time for a malicious actor to get away with their crimes. 

So, for the sake of efficiency, we assume recent logins hold relatively higher risk:

  • We find (250 days + 0.125[3/24 delay] day) produces relatively smaller errors than (1 day +  0.125[3/24 delay] day).
  • Most operations won’t exceed a certain threshold; let’s say 365 days. To save time and computing resources, we omit users who haven’t logged in for over a year. 

Our Solution

We use lambda architecture, which entails a process where we combine batch and streaming pipelines, to achieve stronger feature consistency.

What does the solution look like conceptually?

  • Batch Pipeline: Performs feature engineering for a massive user base.
  • Streaming Pipeline: Remedies batch pipeline delay time for recent logins.

What if a record is ingested into the online feature store between the delay time in batch ingestion?

Our features still maintain strong consistency even when records are ingested during the one-hour batch ingestion delay period. This is because the online feature store we use at Binance returns the latest value based on the event_time you specify when retrieving the value.

Comments

All Comments

Recommended for you

  • BTC Surpasses $71,000

    Market data shows that BTC has surpassed $71,000, currently priced at $71,007.73, with a 24-hour decline of 2.79%. The market is experiencing significant volatility, so please ensure proper risk management.

  • ETH Falls Below $2200

    Market data shows that ETH has fallen below $2200, currently priced at $2199.99, with a 24-hour decline of 3.64%. The market is experiencing significant volatility, so please ensure proper risk management.

  • ETH Surpasses $2200

    Market data shows that ETH has surpassed $2200, currently priced at $2201.53, with a 24-hour decline of 3.92%. The market is experiencing significant volatility, so please ensure proper risk management.

  • BTC Surpasses $71,000

    Market data shows that BTC has surpassed $71,000, currently priced at $71,007.05, with a 24-hour decline of 2.81%. The market is experiencing significant volatility, so please ensure proper risk management.

  • BTC Falls Below $71,000

    Market data shows that BTC has fallen below $71,000, currently priced at $70,974.17, with a 24-hour decline of 2.69%. The market is experiencing significant volatility, so please ensure proper risk management.

  • Trump: U.S. Navy to Begin Blockade of Strait of Hormuz

    U.S. President Trump: The talks between the U.S. and Iran are going well, with most topics reaching a consensus, but the only truly important nuclear issue has not been resolved. Effective immediately, the U.S. Navy will begin a blockade of any vessels attempting to enter or exit the Strait of Hormuz. I have also instructed the U.S. Navy to search and seize every ship in international waters that has paid a toll to Iran.

  • U.S. Negotiation Team Members Have All Left Pakistan

    On April 12, a U.S. official stated that after the U.S.-Iran talks concluded, all members of the U.S. negotiation team have left Pakistan. The official noted that personnel, including Witkoff, Kushner, and the technical team, did not remain in the capital of Pakistan, indicating that both sides will not immediately resume direct negotiations. Earlier that day, Vance mentioned at a press conference in Islamabad that the U.S. has not reached a consensus with Iran and will return to the United States. The negotiations between the U.S. and Iran in Islamabad ended on the 12th without any agreement. The Iranian side stated that the talks were conducted in an atmosphere of 'distrust and suspicion,' with both sides having differences on two or three key issues. The U.S. side claimed that it had clearly outlined its 'red lines,' but the Iranian side did not accept the U.S. conditions.

  • ETH Falls Below $2200

    Market data shows that ETH has fallen below $2200, currently priced at $2195.81, with a 24-hour decline of 1.97%. The market is experiencing significant volatility, so please ensure proper risk management.

  • BTC Surpasses $73,000

    Market data shows that BTC has surpassed $73,000, currently priced at $73,010, with a 24-hour increase of 3.3%. The market is experiencing significant volatility, so please ensure proper risk management.

  • Iranian Armed Forces Declare Readiness to Fire at Any Time

    On April 10, a spokesperson for the Central Headquarters of the Iranian Armed Forces, Khatam al-Anbiya, stated that due to the repeated breaches of trust by the United States and Israel, the Iranian Armed Forces remain on full alert and are ready to fire at any time. (Xinhua News Agency)