Apache Tika

Detect file types (mime types) in Java


Introduction

Apache Tika is a useful open source library written in Java for detecting file types. Most people use it to validate files they accept, such as through a web interface. This is useful for security, since any file extension can be assigned to any file. Only through content inspection can you be sure of a file's real type.

What This Tutorial Covers

What This Tutorial Covers
  1. Installing Apache Tika
  2. Importing & Using Apache Tika

What You Need For This Tutorial

What You Need For This Tutorial

Java 8


Installing Apache Tika

To install Tika, add the following line to your Gradle build:


compile('org.apache.tika:tika-core:1.17')
      

Other install instructions can be found at Apache Tike Getting Started.


How Apache Tika Works

The first thing Tika does is check for magic numbers, which are bytes at the beginning of a file that indicate it's file type. For example, .xlsx files always start with the bytes: 50 4B (those are hexadecimal numbers by the way).

A list of magic numbers can be found at: File Signatures

Some files don't have magic numbers though. In that case, Tika will attempt to determine if the file is a text file by trying to determine its encoding. For example, UTF-8 encoded text files have to follow a certain encoding pattern. Here is a quick explanation is below:


# The following binary patterns indicate how many bytes a UTF-8 character will have (one to four):
0xxx xxxx
110x xxxx xxxx xxxx
1110 xxxx xxxx xxxx xxxx xxxx
1111 0xxx xxxx xxxx xxxx xxxx xxxx xxxx

# x are filled in with the character's Unicode code point, which is just a binary number. This way we can save space using a variable width encoding (if a code point fits in one byte, we can just use one byte). All Unicode code points fit in at most 4 bytes.
        

If the file doesn't match an encoding, then Tika will just label it as arbitrary binary data. This means one of 3 things:

  1. The file is indeed just arbitrary binary
  2. The file is text using an unidentifiable encoding, like Windows-1252 (more on this below)
  3. The file is simply unknown

Window-1252 uses one byte to represent a character. All 256 possible numbers in a byte are mapped to a specific character. That means that all binary is technically valid Window-1252. Therefore, you can't detect when something is in Window-1252, other than to print it and see if it prints out words or just nonsense (future machine learning project?). This can be pretty frustrating since Windows-1252 is the default encoding of all Excel spreadsheets, which is a very popular format. If you grab data from an Excel spreadsheet, make sure you encode it into something like UTF-8 first. Otherwise, you might have trouble identifying it later.


Importing & Using Apache Tika

Tika's detect function will return a string of the detected mime type. For example, arbitrary binary data is "application/octet-stream" and a plain text file is "text/plain". For a complete list of mime types and their associated file extensions, check out this file: MimeTypes

To use Tika in a file is pretty simple. The following code shows a short example of importing and using the Tika object and it's detect function.

Done!

That's it. A lot of text just to explain this, haha: (new Tika()).detect(file.getBytes())