Tag: bulk_extractor

Announcing bulk_extractor 1.2.

bulk_extractor Version 1.2 has been released for Linux, MacOS and Windows.

Key features of Version 1.2 include:

  • Dramatically improved performance of the AES and IP packet scanning modules. (scan_aes runs in 15% the time of the original implementation.) As a result, scan_aes and scan_net are now enabled by default.
  • The stop-list and context-sensitive stop-list processing has been rewritten:
    • Feature files can now be used as context-sensitive stop lists.
    • Feature files with different sized contxt windows can be freely intermixed as stop lists.
    • The program make_context_stop_list.py is no longer used.
    • Stop-list files that are not feature files may contain literals or regular expressions.
  • In practice, this means that the -s option has been removed. You can use -w with a text file that is a list of words, a list of regular expressions, or a feature file. If it is a feature file, it should just work as a context-sensitive stop list. It turns out that it was easier to write it this way than to have different switches for the different kinds of stop lists and then to throw error messages when the wrong kind of list was given to the wrong option.
  • The find (“-f”) option now searches for regular expressions, not globs.
  • Dramatically improved defenses against compression bombs. Now bulk_extractor detects that it is decompressing a compression bomb and goes into a “safe decompress” mode in which new compressed regions are not decompressed if they have an MD5 that matches other compressed regions that have been decompressed. A notation is written into the zip.txt feature file that a compression bomb was encountered.
  • scan_net now carves both IPv4 and IPv6 packets. As in Version 1.1, the resulting packets are put into PCAP files.
  • A new -G option allows the page size to be specified.
  • The pre-compiled Windows binary now runs faster than the Linux binary, although this is because it is not running scan_exif.
  • Wordlist deduplication is significantly faster.
PERFORMANCE STATISTICS
Disk image: /corp/drives/nps/nps-2009-ubnist1/ubnist1.gen3.E01
            /corp/drives/nps/nps-2009-ubnist1/ubnist1.gen3.E02
            Media size:         1.9 GiB (2106589184 bytes)
            MD5:                49a775d8b109a469d9dd01dc92e0db9c
Hardware:   MacBook Pro 2 Ghz Intel Core i7, 8GB 1333 Mhz DDR3
OS:         MacOS 10.7.3
Compiler:   i686-apple-darwin11-llvm-g++-4.2 (GCC) 4.2.1
            (Based on Apple Inc. build 5658)
            (LLVM build 2336.1.00) 

bulk_extractor version 1.1.3:   468.6 seconds (4.28 MBytes/sec)
bulk_extractor version 1.2.0:   350.7 seconds (5.72 MBytes/sec)
Windows 7, same platform, scan_exiv disabled:
bulk_extractor.exe 1.2.0:       207.4 seconds (9.69 MBytes/sec) 

Current list of bulk_extractor scanners:
scan_accts   - Looks for phone numbers, credit card numbers, etc
scan_base64  - decodes BASE64 text
scan_kml     - Detects KML files
scan_gps     - Detects XML from Garmin GPS devices
scan_aes     - Detects in-memory AES keys from their key schedules
scan_json    - Detects JavaScript Object Notation files
scan_exif    - Detects EXIF structures from JPEGs
scan_zip     - Detects and decompresses ZIP files and zlib streams
scan_gzip    - Detects and decompresses GZIP files and gzip stream
scan_pdf     - Extracts text from some kinds of PDF files
scan_hiber   - Detects and decompresses Windows hibernation
               file fragments
scan_winprefetch
             - Detects and extracts fields from Windows
               prefetch files and file fragments.
Current list of bulk_extractor feature files:
aes_keys.txt - AES encryption keys
alerts.txt   - Features found on alert list (redlist)
ccn.txt      - credit card numbers
ccn_track2.txt - Track 2 information
domain.txt   - All extracted domain names and IP addresses
email.txt    - extracted email addresses
ether.txt    - extracted ethernet addresses. Currently
               overcollecting due to a failure to consider
               local context.
exif.txt     - All exif fields from JPEGs; extracted as XML.
find.txt     - Hits on find command.
gps.txt      - Extracted GPS coordinates from Garmin XML and
               GPS-enabled JPEG files
ip.txt       - Extracted IP addresses from scan_net
               cksum-bad indicates checksum test failed;
               those are less likely to actually be IP
               addresses.
json.txt     - Extracted and validated JavaScript Object
               Notation fragments.
kml.txt      - Extracted KML files
report.xml   - The DFXML file that explains what happened.
rfc822.txt   - All extracted RFC822 headers
tcp.txt      - Summaries of all extracted UDP and TCP packets.
telephone.txt- Extracted phone numbers
url.txt      - Extracted URLs
  url_facebook-id - extracted Facebook IDs
  url_microsoft-live - extracted Microsoft Live IDs
  url_searches       - extracted search terms
  url_services       - extracted services from URLs
winprefetch.txt - Windows prefetch files and fragments,
                  recoded as XML for easy processing.
wordlist.txt - All the words
zip.txt      - Information about all ZIP files and zip
               components.

Feature List for 1.3:

We are considering the following features for 1.3:

  • Putting a BOM at the beginning of all feature files and forcing the coding of the features to UTF-8 (The context will still be reported as ASCII with octal escaping of values outside the printable range.)
  • Replacing FTS with a new implementation for searching files.
  • Replacing exiv2 with our own EXIF processor.
  • Automatically detecting and reporting Window shortcut files and IE history.
  • Scanning for the start of bitlocker protected volumes.
  • Support for checkpointing using BLCR.
  • Improved restarting, so that each page is retried once but only once. (Frankly, the improved reliability in verson 1.2 made this request less important.)
  • Support on distributed computing arrays.

We are also considering the following scanners (and need
help!):

  • LZMA decompression
  • RAR & RAR2 decompression
  • BZIP2 decompression
  • MSI decompression
  • CAB decompression
  • NTFS decompression
  • VCARD detection
  • PE Header Detection
  • Better handling of MIME encoding
  • SQLite database identification
  • Processing of physical drives
  • Scanning for MD5 hash codes
  • Scanning for word lists
  • Python bridge, so scanners can be written in python

As always, bulk_extractor can be downloaded from http://afflib.org/

Leave a Comment February 16, 2012

Producing an EXIF csv

This page shows how to use bulk_extractor’s post_process_exif.py script to make a CSV file that you can load into Excel with all of the EXIFs on a disk image.

Start by running bulk_extractor on a disk image:

$ bulk_extractor -o exifdemo /corp/drives/nps/nps-2009-ubnist1/ubnist1.gen3.raw
0: Phase 1.
0: Input file: /corp/drives/nps/nps-2009-ubnist1/ubnist1.gen3.raw
0: Output directory: exifdemo
...

When you are done, you’ll see output that looks like this:

$ ls -l exifdemo
total 38972
-rw-r--r--  1 simsong  staff        64 Nov  7 08:09 _thread0.stat
-rw-r--r--  1 simsong  staff        64 Nov  7 08:09 _thread1.stat
-rw-r--r--  1 simsong  staff        64 Nov  7 08:09 _thread2.stat
-rw-r--r--  1 simsong  staff        64 Nov  7 08:12 _thread3.stat
-rw-r--r--  1 simsong  staff       176 Nov  7 08:12 ccn.txt
-rw-r--r--  1 simsong  staff       128 Nov  7 08:06 config.cfg
-rw-r--r--  1 simsong  staff  13876849 Nov  7 08:12 domain.txt
-rw-r--r--  1 simsong  staff  17500510 Nov  7 08:12 email.txt
-rw-r--r--  1 simsong  staff    202768 Nov  7 08:12 exif.txt
-rw-r--r--  1 simsong  staff         0 Nov  7 08:12 exif_stopped.txt
-rw-r--r--  1 simsong  staff       285 Nov  7 08:12 report.txt
-rw-r--r--  1 simsong  staff   3101978 Nov  7 08:12 rfc822.txt
-rw-r--r--  1 simsong  staff     27651 Nov  7 08:12 telephone.txt
-rw-r--r--  1 simsong  staff   4711492 Nov  7 08:12 url.txt
-rw-r--r--  1 simsong  staff      1771 Nov  7 08:12 url_searches.txt
-rw-r--r--  1 simsong  staff    131051 Nov  7 08:12 url_services.txt
-rw-r--r--  1 simsong  staff    265961 Nov  7 08:12 zip.txt
$

Now run the script post_process_exif.py which is part of the bulk_extractor release (be sure that you have the 0.5.7 release or above):

$ python ~/domex/src/bulk_extractor/post_process_exif.py  exif.txt exif.csv
Input file: exif.txt
Output file: exif.csv
Scanning for EXIF tags...
There are 95 exif tags
$

You can now open the resulting exif.csv file in Excel.

Leave a Comment November 13, 2010

bulk_extractor 0.5.4 released

Version 0.5.4 of bulk_extractor has been released. This version includes “crash protection” (you can have it catch a signal if you want), full support for BASE64 decoding, ZIP, GZIP, and even CCN Track 2 data!  We also found a memory allocation bug in the processing of raw images. So if you were having problems before, you should upgrade now!

1 Comment October 27, 2010

bulk_extractor 0.4.2 is released

bulk_extractor 0.4.2 is released.
Significant features include:
  • Support for context-based stop lists
  • Automatic carving of PKZIP files
  • Improved support for EXIF carving

Context-based stop list

Many users of bulk_extractor report surprise at the large number of email addresses, URLs, JPEGs, and other information that are contained within the standard Microsoft Windows and Linux distributions. For
example, Microsoft Windows XPSP3 contains 306 distinct email addresses, including not just addresses like piracy@microsoft.com and info@valicert.com, but email addresses that look like they belonging to individuals such as mojemeno@msn.com and mittnavn@msn.com.
The initial way that we attempted to resolve this issue was by creating a “stop list” of the distribution email addresses and building that stoplist into the bulk_extractor binary. The problem with this approach, we quickly learned, is that these problematic email addresses might appear in a variety of contexts, but we only want them suppressed when they are harvested as part of the operating system files. For example, we don’t want to be alerted to the mojemeno@msn.com email address when it appears as part of Microsoft Windows, but we do want this email address reported if it is found elsewhere.
To resolve this problem bulk_extractor now supports a context-based stop list. Instead of simply a list of email addresses that should be suppressed, the context-based stop list conatins both the email address and the context in which that email address occures. Here we define “context” to mean the 8 characters before the email address and the 8 characters following the email address in the disk image.
The context-based stop list is distributed as a specially formatted text file that contains the element to be suppressed, a tab, and the element in context. Unprintable characters are reported as underbars. For example, these two entries suppress the two occuresses of the mojemeno@msn.com email address in Windows XPSP3:
mojemeno@msn.com    ail.com_mojemeno@msn.com_priklad
mojemeno@msn.com    il.com__mojemeno@msn.com__prikla
All items suppressed by the traditional regular-expression stop list or the context-based stop list are now presented in separate feature files — for example, email_stop.txt. In no case is information actually suppressed. Presenting the suppressed results is important in for tool validation, both in testing and when the tool is actually run. Stopped terms may also useful for performing a profile of the hard drive.
bulk_extractor now comes with a Python program called make_context_stop_list.py. This program will process the output of bulk_extractor from multiple runs and create a single context-based stop list.  We are also distributing a sample context-based stop list which is derrived from the following operating systems:
  • fedora12-64
  • redhat54-ent-64
  • w2k3-32bit
  • w2k3-64bit
  • win2008-r2-64
  • win7-ent-32
  • win7-utl-64
  • winXP-32bit-sp3
  • winXP-64bit
You can download version 1.0 of the stoplist from: http://afflib.org/downloads/feature_context.1.0.zip Be sure to decompress the list first!  We are distributing it in ZIP form because is the 70 megabytes in length. A future version of bulk_extarctor may read the compressed list directly.
Context-based stop lists correct the stop-list problem that surfaced with bulk_extractor 0.4.0. That version simply suppressed terms that were already present in the Windows and Linux distributions. Unfortunately this created an attack vector in which attackers could register and use these email addresses and in so doing escape detection.

PKZIP Carving

Version 0.4.2 introduces carving of PKZIP components. Whenever bulk_extractor finds a component of a ZIP file that includes a valid header, it attempts to decompress the fragment and then recursively reprocesses the decompressed data with all of the extractors. Currently the results of ZIP carving are reported with standard offsets. In the feature the offsets will be reported NNNNNN-ZIP where NNNNN is the byte offset of the ZIP component.

Improved support for EXIF Carving

Version 0.4.2 finds and carves EXIF headers of JPEG files. All of the results are stored in a feature file that consists of the MD5 hash of the first 4K of the JPEG and an XML structure. bulk_extractor now also comes with a program called post_process_exif.py which reads this file and creates a tab-delimited file that can be imported into Microsoft Excel that breaks each EXIF field into its own spreadsheet column.

Leave a Comment September 27, 2010

bulk_extractor 0.3.3 is released

Minor bugfixes for WIN32.

Leave a Comment June 2, 2010

bulk_extractor 0.3.2 released

bulk_extractor 0.3.1 is released. This version has several new
features based on user-feedback, and a few bug fixes based on a
thorough code review.

New Features:

  • url_services.txt – a histogram of all URLs by domain.

  • url_searches.txt – a histogram of all search terms, including Google, Yahoo, Bing, and any other search service with “search” in the domain and “q=” or “p=” in the URL.
  • ccn.txt – this file now reports Federal Express account numbers, SSNs (if properly formatted or prefixed), DOBs, and other info.
  • tcp.txt – This experimental feature looks for IP and TCP packets in PAGEFILE.SYS, memory dumps and hibernation files, and stores the results.
  • the whitelist and redlist files may now contain globbed terms. For example, put *@company.com in the redlist and any mention of anyone@company.com will be flagged and also put into a special file called redlist_found.txt.
  • CONTEXT: The ccn.txt now show the context from which the matched
    information was taken. hosts.txt shows context for numeric IP addresses.

Bug Fixes:

  • Improved handling of raw devices and files.

  • bulk_extractor is now less likely to error on some input data sets.
  • A crashing bug that impacted bulk_extractor 0.3.1 has been addressed.

Leave a Comment May 25, 2010

bulk_extractor 0.2.1 released

I am pleased to announce the release of bulk_extractor 0.2.1.  This version corrects a few minor bugs in version 0.1.0 and is available immediately.  We have also increased the version number from 0.1.x to 0.2.x to reflect a total rewrite in the way that the underlying flex architecture is implemented.

Leave a Comment April 25, 2010


Pages

Blogroll

Downloads

Meta

Tags