Converting Dataframe Columns to Vectors in Spark
up vote
-1
down vote
favorite
I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.
For example in the documentation for scala for doing PCA is:
"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"
etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca
Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
If anything was unclear please let me know.
Thanks!
scala apache-spark apache-spark-ml
New contributor
add a comment |
up vote
-1
down vote
favorite
I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.
For example in the documentation for scala for doing PCA is:
"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"
etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca
Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
If anything was unclear please let me know.
Thanks!
scala apache-spark apache-spark-ml
New contributor
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.
For example in the documentation for scala for doing PCA is:
"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"
etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca
Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
If anything was unclear please let me know.
Thanks!
scala apache-spark apache-spark-ml
New contributor
I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.
For example in the documentation for scala for doing PCA is:
"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"
etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca
Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
If anything was unclear please let me know.
Thanks!
scala apache-spark apache-spark-ml
scala apache-spark apache-spark-ml
New contributor
New contributor
edited yesterday
desertnaut
15k53060
15k53060
New contributor
asked yesterday
Lucas Newman
1
1
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.
Help docs are here
vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html
If you need more than PCA you can use low-level RDD transformations.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.
Help docs are here
vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html
If you need more than PCA you can use low-level RDD transformations.
add a comment |
up vote
1
down vote
What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.
Help docs are here
vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html
If you need more than PCA you can use low-level RDD transformations.
add a comment |
up vote
1
down vote
up vote
1
down vote
What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.
Help docs are here
vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html
If you need more than PCA you can use low-level RDD transformations.
What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.
Help docs are here
vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler
ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html
If you need more than PCA you can use low-level RDD transformations.
edited 4 hours ago
sramalingam24
6741614
6741614
answered yesterday
ookboy24
563
563
add a comment |
add a comment |
Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.
Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.
Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.
Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219230%2fconverting-dataframe-columns-to-vectors-in-spark%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password