This project allows profiles to be executed using Apache Spark. This is a port of the Profiler to Spark that allows you to backfill profiles using archived telemetry.
Using the Streaming Profiler in Apache Storm allows you to create profiles based on the stream of telemetry being captured, enriched, triaged, and indexed by Metron. This does not allow you to create a profile based on telemetry that was captured in the past.
There are many cases where you might want to produce a profile from telemetry in the past. This is referred to as profile seeding or backfilling.
As a Security Data Scientist, I want to understand the historical behaviors and trends of a profile so that I can determine if the profile has predictive value for model building.
As a Security Platform Engineer, I want to generate a profile using archived telemetry when I deploy a new model to production so that models depending on that profile can function on day 1.
The Batch Profiler running in Apache Spark allows you to seed a profile using archived telemetry.
The portion of a profile produced by the Batch Profiler should be indistinguishable from the portion created by the Streaming Profiler. Consumers of the profile should not care how the profile was generated. Using the Streaming Profiler together with the Batch Profiler allows you to create a complete profile over a wide range of time.
For an introduction to the Profiler, see the Profiler README.
Create a profile definition by editing $METRON_HOME/config/zookeeper/profiler.json as follows.
cat $METRON_HOME/config/zookeeper/profiler.json { "profiles": [ { "profile": "hello-world", "foreach": "'global'", "init": { "count": "0" }, "update": { "count": "count + 1" }, "result": "count" } ], "timestampField": "timestamp" }
Ensure that you have archived telemetry available for the Batch Profiler to consume. By default, Metron will store this in HDFS at /apps/metron/indexing/indexed/*/*.
hdfs dfs -cat /apps/metron/indexing/indexed/*/* | wc -l
Copy the hbase-site.xml file from /etc/hbase/conf to /etc/spark2/conf. It is advised to create a symlink to avoid the duplication of file, also to keep consistency between files while config updates.
ln -s /etc/hbase/conf/hbase-site.xml /etc/spark2/conf/hbase-site.xml
Review the Batch Profiler’s properties located at $METRON_HOME/config/batch-profiler.properties. See Configuring the Profiler for more information on these properties.
You may want to edit the log4j properties that sits in your config directory in ${SPARK_HOME} or create one. It may be helpful to turn on DEBUG logging for the Profiler by adding the following line.
log4j.logger.org.apache.metron.profiler.spark=DEBUG
Run the Batch Profiler.
source /etc/default/metron cd $METRON_HOME $METRON_HOME/bin/start_batch_profiler.sh
Query for the profile data using the Profiler Client.
The Batch Profiler package is installed automatically when installing Metron using the Ambari MPack. See the following notes when installing the Batch Profiler without the Ambari MPack.
A script located at $METRON_HOME/bin/start_batch_profiler.sh has been provided to simplify running the Batch Profiler. This script makes the following assumptions.
The script builds the profiles defined in $METRON_HOME/config/zookeeper/profiler.json.
The properties defined in $METRON_HOME/config/batch-profiler.properties are passed to both the Profiler and Spark. You can define both Spark and Profiler properties in this same file.
The script assumes that Spark is installed at /usr/hdp/current/spark2-client. This can be overridden if you define an environment variable called SPARK_HOME prior to executing the script.
The Batch Profiler may also be started using spark-submit as follows. See the Spark Documentation for more information about spark-submit.
${SPARK_HOME}/bin/spark-submit \ --class org.apache.metron.profiler.spark.cli.BatchProfilerCLI \ --properties-file ${SPARK_PROPS_FILE} \ ${METRON_HOME}/lib/metron-profiler-spark-*.jar \ --config ${PROFILER_PROPS_FILE} \ --profiles ${PROFILES_FILE}
The Batch Profiler accepts the following arguments when run from the command line as shown above. All arguments following the Profiler jar are passed to the Profiler. All argument preceeding the Profiler jar are passed to Spark.
Argument | Description |
---|---|
-p, --profiles | Path to the profile definitions. |
-c, --config | Path to the profiler properties file. |
-g, --globals | Path to the Stellar global config file. |
-r, --reader | Path to properties for the DataFrameReader. |
-h, --help | Print the help text. |
The path to a file containing key-value properties for the Profiler. This file would contain the properties described under Configuring the Profiler.
Spark supports a number of different cluster managers. The underlying cluster manager is transparent to the Profiler. To run the Profiler on a particular cluster manager, it is just a matter of setting the appropriate options as defined in the Spark documentation.
By default, the Batch Profiler instructs Spark to run in local mode. This will run all of the Spark execution components within a single JVM. This mode is only useful for testing with a limited set of data.
$METRON_HOME/config/batch-profiler.properties
spark.master=local
To run the Profiler using Spark on YARN, at a minimum edit the value of spark.master as shown. In many cases it also makes sense to set the YARN deploy mode to cluster.
$METRON_HOME/config/batch-profiler.properties
spark.master=yarn spark.submit.deployMode=cluster
See the Spark documentation for information on how to further control the execution of Spark on YARN. Any of these properties can be added to the Profiler properties file.
The following command can be useful to review the logs generated when the Profiler is executed on YARN.
yarn logs -applicationId <application-id>
See the Spark documentation for information on running the Batch Profiler in a secure, kerberized cluster.
The Profiler can consume archived telemetry stored in a variety of input formats. By default, it is configured to consume the text/json that Metron archives in HDFS. This is often not the best format for archiving telemetry. If you choose a different format, you should be able to configure the Profiler to consume it by doing the following.
$METRON_HOME/config/batch-profiler.properties
profiler.batch.input.format=org.apache.spark.sql.execution.datasources.orc */*profiler.batch.input.path=hdfs://localhost:9000/apps/metron/indexing/orc/ ```
The following examples highlight the configuration values needed to read telemetry stored in common formats. These values should be defined in the Profiler properties (see --config).
profiler.batch.input.reader=orc profiler.batch.input.path=/path/to/orc/
profiler.batch.input.reader=parquet profiler.batch.input.path=/path/to/parquet/
By default, the configuration for the Batch Profiler is stored in the local filesystem at $METRON_HOME/config/batch-profiler.properties.
You can store both settings for the Profiler along with settings for Spark in this same file. Spark will only read settings that start with spark..
Setting | Description |
---|---|
profiler.batch.input.path | The path to the input data read by the Batch Profiler. |
profiler.batch.input.reader | The telemetry reader used to read the input data. |
profiler.batch.input.format | The format of the input data read by the Batch Profiler. |
profiler.batch.input.begin | Only messages with a timestamp after this will be profiled. |
profiler.batch.input.end | Only messages with a timestamp before this will be profiled. |
profiler.period.duration | The duration of each profile period. |
profiler.period.duration.units | The units used to specify the profiler.period.duration. |
profiler.hbase.salt.divisor | A salt is prepended to the row key to help prevent hot-spotting. |
profiler.hbase.table | The name of the HBase table that profiles are written to. |
profiler.hbase.column.family | The column family used to store profiles. |
Default*/*: hdfs://localhost:9000/apps/metron/indexing/indexed/
The path to the input data read by the Batch Profiler.
Default: json
Defines how the input data is treated when read. The value is not case sensitive so JSON and json are equivalent.
See Common Formats for further information.
Default: text
The format of the input data read by the Batch Profiler. This is optional and not required in most cases. For example, this property is not required when profiler.batch.input.reader is json, orc, or parquet.
Default: undefined; no time constraint
Only messages with a timestamp equal to or after this will be profiled. The Profiler will only profiles messages with a timestamp in [profiler.batch.input.begin, profiler.batch.input.end] inclusive.
By default, no time constraint is defined. The value is expected to follow the ISO-8601 instant format; 2011-12-03T10:15:30Z.
Default: undefined; no time constraint
Only messages with a timestamp before or equal to this will be profiled. The Profiler will only profiles messages with a timestamp in [profiler.batch.input.begin, profiler.batch.input.end] inclusive.
By default, no time constraint is defined. The value is expected to follow the ISO-8601 instant format; 2011-12-03T10:15:30Z.
Default: 15
The duration of each profile period. This value should be defined along with profiler.period.duration.units.
Important: To read a profile using the Profiler Client, the Profiler Client’s profiler.client.period.duration property must match this value. Otherwise, the Profiler Client will be unable to read the profile data.
Default: MINUTES
The units used to specify the profiler.period.duration. This value should be defined along with profiler.period.duration.
Important: To read a profile using the Profiler Client, the Profiler Client’s profiler.client.period.duration.units property must match this value. Otherwise, the Profiler Client will be unable to read the profile data.
Default: 1000
A salt is prepended to the row key to help prevent hotspotting. This constant is used to generate the salt. This constant should be roughly equal to the number of nodes in the Hbase cluster to ensure even distribution of data.