Chapter 4 Anti-Virus. Anti-Virus Three tasks for anti-virus 1. Detection o Infected or not?...

Chapter 4

Anti-Virus

Anti-Virus Three tasks for anti-virus1. Detection

o Infected or not? Provably undecidable…

2. Identificationo May be separate from detection,

depending on detection method used

3. Disinfectiono Remove the virus

Detection: Static Methods Generic methods

o Detects known and unknown viruseso For example, anomaly detection

Virus-specific methodso Detects known viruseso For example, signature detection

Static --- virus code not running Dynamic --- virus code running

Detection Outcomes

Detection Outcomes Also can have ghost positive Virus remnant “detected”

o But virus is no longer there How can this happen?

o Previous disinfection was incomplete

Static Detection Detection without running virus

code Three approaches…1. Scanners

o Signature

2. Heuristicso Look for “virus-like” code

3. Integrity Checkerso Hash/checksum

Scanners On-demand

o Files scanned when you say so On-access

o Constant scanning in backgroundo Whenever file is accessed, it’s

scanned

Scanners Signature scanning

o Viruses represented by “signature”o Signature == pattern of bits in a virus (might

include wildcards) “Hundreds of thousands of signatures” Not feasible to scan one-by-one

o Multiple pattern searcho Efficiency is critical

We look in detail at several algorithms

Algorithm: Aho-Corasick Developed 1975, bibliographic

search Based on finite automaton (graph)

o Circles are search stateso Edges are transitionso Double circles are final states/output

And a failure functiono What to do when no suitable

transitiono I.e., where to resume “matching”

Algorithm: Aho-Corasick When virus scanning, search for

virus signature, which is bit string For simplicity, illustrate algorithm

using English words For our example… Scan for any of the following

words:o hi, hips, hip, hit, chip

Algorithm: Aho-Corasick

Aho-CorasickExample

Algorithm: Aho-Corasick How to construct automaton?

o And failure function Build the automaton --- next slide

o A “trie”, also known as a “prefix tree” Then determine failure function

o Two slides ahead

Aho-Corasick:Trie

Labels added in breadth-first order

Closest to root get smallest numbers

Aho-Corasick: Failure Function

Depth 1 nodes o Fail goes back to start state

For other stateso Go back to earliest place where

search can resumeo Pseudo-code is in the book

Aho-Corasick The bottom line… Linear search that can find

multiple signatureso Like searching in parallel for related

signatures Efficient representation of

automaton is the challengeo Both time and space issues

Algorithm: Veldman Linear search on “reduced”

signatureso Sequential search on reduced set

From each signature, select 4 adjacent non-wildcard byteso Want as many signatures as possible

to have each selected 4-byte pattern Then use 2 hash tables to filter…

o Hash tables: 1st 2 bytes & 2nd 2 bytes

Algorithm: Veldman Example Suppose the following 5 signatures

o blar?g, foo, greep, green, agreed Select 4-byte patterns, no

wildcards:

Algorithm: Veldman Hashes act as filters Test things that pass thru both

filterso In this example, get things like “grar”

Algorithm: Veldman Veldman allows for wildcards and

complex signatureso Aho-Corasick does not

But both algorithms analyze every byte of input

Is it possible to do better?o That is, can we skip some of the

input?

Algorithm: Wu-Manber Like Veldman’s algorithm

o But can skip over bytes that can’t possibly match

o Faster, improved performance Illustrate algorithm with same

signatures used for Veldman’s:o blar?g, foo, greep, green, agreed

Algorithm: Wu-Manber

Calculate MINLENo Min length of any pattern

substring Two hash tables

o SHIFT --- number of bytes that can safely be skipped

o HASH --- mapping to signatures Input bytes denoted b1,b2,…,bn

Start at bMINLEN consider byte pairs


Example: Suppose hash tables are…

Wu-ManberExample

Here, MINLEN = 3

Start at bMINLEN

Algorithm: Wu-Manber How to construct hash tables? It’s a 4-step process

o Calculate MINLENo Initialize SHIFT tableo Fill SHIFT tableo Fill HASH table

Algorithm: Wu-Manber Calculate MINLEN

o Minimum number of adjacent, non-wildcard bytes in any signature

For this example, we haveo blar?g 4 foo 3o greep 5 green 5o agreed 6

So we have MINLEN = 3


SHIFT table Extract MINLEN pattern

substringso blar?g bla foo fooo greep gre green greo agreed agr

Extract all distinct 2-byte sequenceso bl, la, fo, oo, gr, re, ag

If input pair is not one of these, safe to skip MINLEN - 1 bytes

Algorithm: Wu-Manber SHIFT table Initialize SHIFT table to MINLEN – 1 For 2-byte pairs: bl, la, fo, oo, gr, re, ag

oDenote as xyoLet qxy be rightmost ending position of

xy in any pattern substringoFor example, gr in agr and gre, but bl in

blaoSo, qgr = 3 while qbl = 2

oThen set SHIFT[xy] = MINLEN – qxy

Note: Wildcard matches everything…


HASH table If SHIFT[xy] = MINLEN – qxy = 0

o Then we are at right edge of a pattern So, set HASH[xy] to all signatures

with pattern substring ending xy For example

o HASH[gr] agreedo HASH[re] greep, green

Algorithm: Wu-Manber Here, we illustrated simplest form

of the algorithm More advanced forms can handle

10s of thousands of signatures Worst case performance is terrible

o Sequential search thru every byte of input for every signature…

But tests show it’s good in practice

Testing How can we know if scanner

works? Test on live viruses?

o Might not be a good idea EICAR standard antivirus test file

o Not too useful either So, what to do?

o Author doesn’t have any suggestions!

Improving Performance “Grunt scanning” --- scan

everythingo Slow slow slow

Search only beginning and end of files

Scan code entry pointo And points reachable from entry point

If position of virus in file is known…o Make it part of the “signature”

Limit scans to size of virus(es)

Improving Performance Only scan certain types of files

o Not so viable today Only rescan files that have

changedo How to detect change?o Where to store this info? Cache?

Database? Tagged to file?o Updates to signatures? Must rescan…o How to checksum efficiently?

Improving Performance How to checksum efficiently?

o Checksum entire file might take longer than scanning it

o Only checksum parts that are scanned

How to avoid checksum tampering?o Encrypt? Where to store the key?o Checksum the checksums?o Other?

Improving Performance Improve the algorithm

o Maybe tailor algorithms to file type Optimize implementation

o May be of limited value Other?

Static Heuristics Like having expert look at code… Look for “virus-like” code

o Static, so we don’t execute the code 2 step process

o Gather datao Analyze data

Static Heuristics

What data to gather? “Short signatures” or boosters

o Junk codeo Decryption loopo Self-modifying codeo Undocumented API callso Unusual/non-compiler instructionso Strings containing obscenities or

“virus”Stopper --- thing virus would not do

Static Heuristics Other heuristics include… Length of code

o Too short? May be appended virus Statistical analysis of instructions

o Handwritten assemblyo Encrypted code

Might look for signature heuristicso Common characteristics of signatures

Static Heuristics Analysis phase May be simple…

o Weighted sum of various factorso Unusual opcodes, etc.

…or complexo Machine learning (HMM, neural nets,

etc.)o Data miningo Heuristic search (genetic algorithm,

etc.)

Integrity Checkers Look for unauthorized change to

files Start with 100% clean files Compute checksums/hashes Store checksums Recompute checksums and

compareo If they differ, a change has occurred

Integrity Checkers

3 types of integrity checkers Offline --- recompute checksums

periodically (e.g., once/week) Self-checking --- modify file to

check itself when run o Essentially, a beneficial “virus”o For example, virus scanner self-checks

Integrity shell --- OS performs checksum before file executed

Detection: Dynamic Methods

Detection based on running the codeo Observe the “behavior”

Two type of dynamic methodso Behavior monitor/blockerso Emulation

Behavior Monitor/Blocker Monitor program as running Watch for “suspicious” behavior What is suspicious?

o It’s too far from “normal” What is normal?

o A statistical measure --- mean, average

How far is too far?o Depends on variance, standard

deviation

Behavior Monitor/Blocker “Normal” monitored in 3 ways…1. Actions that are permitted

o White list, positive detection

2. Actions that are not permittedo Black list, negative detection

3. Some combination of these two Analogies to immune system

o Distinguish self from non-self

Behavior Monitor/Blocker “Care must be taken… because

anomalous behavior does not automatically imply viral behavior”o That’s an understatement!

This is the fundamental problem in anomaly detectiono Potential for lots of false positives

Behavior Monitor/Blocker

Look for short “dynamic signatures”o Like signature detection, but input

string generated dynamically But what to monitor? Infection-like behavior?

o Open an exe for read/writeo Read code start address from headero Write start address to headero Seek to end of exe, append to exe, etc.

Behavior Monitor/Blocker How to reduce false positives?

o Consider “ownership” --- some apps get more leeway (e.g., browser clearing cache)

How to prevent damage?o “Dynamic” implies code actually

running…o System undo capability?

How long to monitor? o Monitoring increases overheado Can virus outlast monitor?

Emulation Execute code, but not for real… Instead, emulate execution Emulation can provide all of the info

gotten thru code executiono But much safer

“Execute” code in emulatoro Gather info for static/dynamic signatures or

heuristicso Behavior blocker stuff applies too

Emulation Emulation and polymorphic

detectiono Let virus decrypt itselfo Then use ordinary signature scan

When has decryption occurred?o Use some heuristics…o Execution of code that was modified

(decrypted) or in such a memory location

o More than N bytes of modified code, etc.

Emulator Anatomy Emulate by single-stepping thru

code?o Easily detected by viruses (???)o Danger of virus “escaping” emulator

“A more elaborate emulation mechanism is needed”o Why?

Conceptually, 5 parts to an emulatoro Next slide please…

Emulator Anatomy 5 parts to new-and-improved

emulator1. CPU emulation --- nothing more to

say2. Memory emulation3. Hardware and OS emulation4. Emulation controller5. Extra analyses

Memory Emulation This could be difficult…

o 32-bit addressing, so 4G of “memory” Do we need to emulate all of this?

o No, most apps only uses small amount

Keep track of memory that’s modified and where it is locatedo Only need to deal with memory that

is modified by a specific app/virus

Hardware/OS Emulation Use stripped-down, fake OS, due to…

o Copyright issueso Sizeo Startup timeo Emulator needs additional monitoring

What about OS system calls?o Return faked/fixed valueso Don’t faithfully emulate some low-level

stuff

Emulation Controller When does emulation stop?

o Can’t expect to run code to completion…

Use heuristics to decide when to stopo Number of instructions?o Amount of time?o Threshold on percent of instructions

that modify memory?o “Stoppers”? E.g., assume virus

wouldn’t write output before being malicious

Emulator: Extra Analyses Post-emulation analysis For example, look at histogram of

instructionso Does it match typical polymorphic?o Does it match a metamorphic family?

Other examples of post-emulation analysis???

If at First You Don’t Succeed Emulation controller may re-invoke

emulator for the following reasonso Rerun with different CPU parameterso Test interrupt handlerso Test multiple possible entry pointso Test for self-replication on “goat” fileso Test untaken branches in codeo Test “unused” memory locations

Emulator Optimizations Improve performance, reduce size

and/or complexityo Use the real file system (with caution)o “Data” files must be checked for

malware, use lots of stopperso Cache state --- if match is found to

previous (non-virus) run, goto next file Cache register values, size, stack pointer

and contents, number of writes, checksums, etc.

Comparison of Techniques Recall, the techniques

considered…1. Scanning2. Static heuristics3. Integrity check4. Behavior blocker5. Emulation

Comparison of Techniques Scanning Pros:

o Precise ID of malware Cons:

o Requires up-to-date signatureso Cannot detect new/unknown malware

Comparison of Techniques Static heuristics Pros:

o Detect known and unknown malware Cons:

o Detected malware not identifiedo False positives

Comparison of Techniques Integrity check Pros:

o Can be efficient and fasto Detect known and unknown malware

Cons:o Detected after infection & not

identifiedo Can’t detect in new/modified fileo Heavy burden on users/admins

Comparison of Techniques Behavior blocker Pros:

o Known and unknown malware detected

Cons:o Probably won’t identify malwareo High overheado False positiveso Malware runs on system before

detected

Comparison of Techniques Emulation Pros:

o Known, unknown, polymorphic detection

o Malware executed in “safe” environment

Cons:o Slowo Malware might outlast emulatoro Might not provide identification

Detection: Bottom Line Static analysis is fast

o Good approach when it works Dynamic analysis can “peel away a

layer of obfuscation”o Dynamic analysis is relatively costly

Verification, Quarantine, Disinfect

What to do after virus detected?1. Verify that it really is a virus2. Quarantine infected code3. Disinfect --- remove infection

These are done rarely, so can be slow and costly in comparison to detection

Verification After detection comes verification Why verify?

o Secondary test needed due to short, general signature, or…

o …no signature, due to detection method

Behavior, heuristic, emulation, etc.o Do not usually provide identification

Writer might try to make virus look like some other virus

Verification How to verify? “X-ray” the virus If encrypted, decrypt it, or

frequency analysis might sufficeo Like simple substitution cipher

Extract info/stats, etc.

Verification After x-ray analysis…

o Longer virus-specific signatureso Checksum all or part of viruso Call special-purpose verification code

Note that these probably won’t work on (good) metamorphic code

Quarantine Isolate detected virus from system

o Then ask user if it’s OK to disinfecto Or do further analysis of virus

How to quarantine virus?o Copy to a “quarantine” directory?o Hide it in “invisible” location?o Encrypt it?

Disinfect Disinfect == remove infection Not always possible to return file to

it’s original stateo E.g., file might have been overwritten

Disinfection methods… Delete the infected file

o Pros and cons?

Disinfect Disinfection methods… Restore files from backup

o Pros and cons? Use virus-specific info

o Info may be found automatically --- compare infected files with uninfected

o E.g., appended virus, changes start address, appends itself to file, etc.

o Like a chosen plaintext attack

Disinfect Disinfection methods… Use virus-behavior specific info

o E.g., prepended virus changes header Save some info about files

o Headers info, for exampleo Then changed parts can be restoredo Integrates well with integrity checkero Restore parts until checksum

matches…

Disinfect Disinfection methods… Use the virus to disinfect

o Stealth virus may give original code Generic disinfection

o Virus may restore code when executed

o Might be dangerous to run virus code…

o …emulation is a better strategy, maybe even disinfect as part of detection

Virus Databases What to put in a virus database?

o Name of virus?o Characteristics of virus?o Signatures?o Encrypted/hashed signatures?o Disinfection info?o Other info?

Virus Databases How to update

database/signatures?o Push or pull?o Automatic or manual?o How often to update?o How to distribute updates?o Distribute entire database or deltas?

Also must be able to update AV software

Virus Updates Update process is a BIG target

o AV’s machines that distribute updateso Insider attack at AV siteo Trick user to getting “AV” from

attackero Man-in-the-middle attack on

communications between user/AV

Virus Description Languages

AV vendors have specialized virus description languages

2 examples given in the book

Short Subjects A few quick points… Anti-stealth techniques Macro viruses Compiler optimizations and

detection

Anti-Stealth Techniques Recall, stealth viruses hide

presence Anti-stealth as part of AV?

o Detect and disable stealth --- check that OS calls go to right place

o Bypass usual OS features --- direct calls to BIOS, for example

Macro Virus Detection Macro viruses tricky to detect

o Macros are in source codeo Easy to change sourceo Robust execution when errors occur

So, any changes can create new virus

AV might create a new viruso Eg, incomplete disinfection

Macro virus can infect other macros

Macro Viruses One redeeming feature… They operate in restricted domain

o So easier to determine “normal”o Reduces number of false positives

Most/all are not parasitico More like companion viruses

All the usual detection techniques can be applied

Macro Viruses: Disinfection Delete all macros in infected

document Delete all associated macros Delete macro if in doubt (heuristic) Emulation to find all macros used

by infected macro, and delete them Basic idea?

o Err on side of caution/deletion Macro viruses not so common today

Compiler Optimization Compilers use similar techniques as

AV “Optimizing compiler” for

detection??o Constant propagation – reduces

variableso Dead code (executed, but not needed)o Polymorphics may have lots of dead

code If used, efficiency could be an issue

o Compilers extensively studiedo Bad cases well-known, so virus writers

might take advantage of these

Chapter 4 Anti-Virus. Anti-Virus Three tasks for anti-virus 1. Detection o Infected or not?...

Documents

Transcript of Chapter 4 Anti-Virus. Anti-Virus Three tasks for anti-virus 1. Detection o Infected or not?...