Skip to content

Integrating External Data with Investigations Using Amazon S3 and Athena

What external data can be used

With Athena it's possible to add and integrate external data of many different formats, including common formats like JSON, CSV, and Parquet, into investigations queries. The basic process for doing this will be fairly similar regardless of format:

  1. Determine the schema for your data
  2. Load the data into S3
  3. Create the table
  4. Join the external data with internal data

Example External data

This example is a sample log from a ssh proxy that allows users to login to servers with their real credentials.

Headers:

session_id, login_date, real_user, host, local_user

File:

10203,"2020-09-21 03:04:05.324",laneb,ip-172-31-76-158,ec2-user
10204,"2020-09-21 15:14:22.324",mikep,ip-172-31-76-158,ec2-user

Load the data into S3

In order for Athena to access the data you must upload it to S3. Many tools have the ability to output logs to S3 but, for this example, because of the small file size simply uploading it is sufficient. Data can be uploaded to S3 as one or more files. Because of this it's a good idea to put them in a folder matching the name of the table you will create in the next step.

Determining the Schema

Based on the login data from the prior example, the data types for this file would be as follows:

session_id int
login_date string
real_user  string
host       string
local_user string

Athena Schemas are sql schemas and are created with a CREATE EXTERNAL TABLE statement. For this table, the create statement would be as follows:

CREATE EXTERNAL TABLE proxy_logs (
    session_id INT,
login_date STRING, 
real_user STRING,
host STRING,
local_user STRING
) 
ROW FORMAT SERDE `org.apache.hadoop.hive.serde2.OpenCSVSerde`
WITH SERDEPROPERTIES (
    "separatorChar" = ",", "quoteChar" = "\" )
LOCATION `s3://[BUCKET_NAME]/proxy_logs`; 

Running this in Athena creates a table proxy_logs. Additionally, it tells Athena that the files containing its data is located in [BUCKET_NAME]/proxy logs/ where [BUCKET_NAME] is replaced with the bucket from the previous step.

Join the external data with internal data

With the table created, the data is now exposed in Athena. You can see it with a select statement:

Here's an example:

SELECT * FROM proxy_logs ORDER BY session_id DESC LIMIT 100

This will get the last 100 login sessions. This also means that joining the new table with investigations data is possible. For example the following query will get all of the users who logged into a host in the hour before an alert:

SELECT *
FROM proxy_logs pl
LEFT JOIN sensors s
ON s.hostname = pl.host
LEFT JOIN alerts a
ON s.sensor_id = a.sensor_id
WHERE a.incident_id = `adde60c8-4c72-4662-8266-5d968cca20e4` AND date_diff(`day`, date_parse(login_date, '%Y-%m-%d %H:%i:%s'), from_unixtime(a.unix_nano_timestamp / 1e9)) BETWEEN -1 AND 0