RESEARCH: Deepfake Social Engineering

How do you know that you are really speaking to the person you believe that you are speaking to?

Deepfake and related synthetic media technologies may represent the greatest revolution in social engineering capabilities yet developed. In recent years, scammers have used synthetic audio in vishing attacks to impersonate executives to convince employees to wire funds to unauthorized accounts. In March 2021, the FBI warned the security community to expect a significant increase in synthetic media enabled scams over the next 18 months. The security community is at a highly dynamic moment in history in which the world is transitioning away from being able to trust what we experience with our own eyes and ears.

The woman in the picture above is not real. She was generated by an Artificially intelligent algorithm. You can generate images of fake people for yourself at this site. My research in this area initially focused on developing the Synthetic Media Attack Framework to describe and catalog these attacks. I am now shifting focus toward developing human-centric countermeasures, neuro-signatures of detection, and the development of security policies designed to defeat these attacks. While several technology-based methods to detect synthetic media such as currently exist, my work focuses discussion on human centered countermeasures to synthetic media attacks because most technology-based solutions are not currently available to the average user and are difficult to apply in real-time.

I recently proposed the synthetic media attack framework as a tool for researchers to better describe and catalog synthetic media attacks and as way for security practitioners to build more effective threat models and anticipate emerging tactics, techniques, and procedures (TTPs) and point toward potential countermeasures. This framework utilizes five-dimensions encompassing the following as dimensions: Medium, Control, Familiarity, Interactivity, and Intended Target.

Medium describes the communications medium that is being operationalized for the attack. While deepfakes represent the form of synthetic media that generate the most media attention, synthetic media attacks might be purely text-based, audio-based, video-based, or include combinations of these as with multi-channel attacks. For example, a sock puppet on Twitter might pass as legitimate by relying only on simple text-based Tweets and a profile image. By contrast, a phishing scammer launching an attack through a video chat zishing (aka Zoom phishing) platform will need a richer synpuppet (synthetic media puppet) that recruits a combination of video and audio deepfake technology.

Interactivity describes the degree to which the syn-puppet is interactive with the intended target. In the example of the cheerleading mom the deepfake video was pre-recorded video that was not intended to interact with the audience, only viewed. The Interactivity of synthetic media attacks can range from a non-interactive as with a voice message, to high-asynchrony as in an email exchange, to low-asynchrony as in instant messaging, to real-time interactivity as in chatting with someone in a phone conversation or video chat.

Control considers whether the synpuppet is controlled by an artificially intelligent ‘bot’ (software agent that perceives and responds to their environment) or controlled by human ‘puppeteers’ who act behind synthetic personas to control the actions of these digital sock puppets. Control of the synpuppet is a critical aspect to understanding and classifying synthetic media attacks because the ability to offload control to automation allows criminals to massively scale their enterprises at little cost. I have personally reviewed the records of multiple gift card scam exchanges, and these appear to be run by bots for the initial interactions and my observation corresponds with the observations of other security researchers⁶. Shortly after a obtaining the a few interactive responses with a potential victim a human actor takes over the conversation. By relying on a simple chat script, the scammer can focus on engaging with people predisposed to responding, instead of typing large numbers of emails that will not receive responses. The criminals in these scams often impersonate a person the victim likely knows (usually a superior such as a supervisor or department head) and requests the victim to purchase gift cards for some fictious purpose. It is easy to imagine how much more effective these scams will become when the scammer is able to impersonate the voice of the target’s real supervisor.

Familiarity refers to the ability to realistically impersonate someone whom the target is familiar with. This ability adds an unprecedented and game changing capability to a criminal’s arsenal. This is likely the most significant distinguishing factor that demarcates synthetic media attacks from traditional social engineering. The Familiarity dimension of the synthetic media attack framework refers to the pretextual relationship of the syn-puppet with the target, which ranges from unfamiliar (might not even be a real person), to familiar (a coworker or celebrity), to close (close friend or relative).

The Intended Target represents the final dimension of the synthetic media attack framework and itself encompasses two subdimensions, human versus automation and narrow cast versus broad cast. The first subdimension refers to whether the synthetic media attack is intended to deceive a human or an algorithmic target, and the second refers to whether the influence of the synthetic media is intended to deceive an individual or a broader target audience. While human targets are the first which come to mind when contemplating synthetic media enabled social engineering, it is likely that synthetic media will likely be deployed against automation, particularly when defeating biometrics, if criminals have not already done so.

Examples of Intended Target by Subdimensions.

You may read more about these attacks here. h ttps://www.belay7.com/papers/BL7DeepfakeFramework.pdf