Converting Dataframe Columns to Vectors in Spark

up vote
-1
down vote

favorite

I am new to spark and am trying to use some of the MLlib functions to help me on a school project. All the documentation for how to do analytics with MLlib seems to use vectors and I was wondering if I could just configure what I wanted to do to a data frame instead of a vector in spark.

For example in the documentation for scala for doing PCA is:

"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"

etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca

Is there a way I dont have to create these vectors and just configure it to the dataframe I have already created. The dataframe I have already created has 50+ columns and 15,000+ rows so making vectors for each column isnt really feasible.
Does anyone have any ideas or suggestions. Lastly, unfortunately for my project I am limited to using Spark in Scala I am not allowed to use Pyspark, Java for Spark, or SparkR.
If anything was unclear please let me know.
Thanks!

edited yesterday

desertnaut

15k53060

asked yesterday

Lucas Newman

New contributor

add a comment |

up vote
-1
down vote

favorite

For example in the documentation for scala for doing PCA is:

"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"

etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca

edited yesterday

desertnaut

15k53060

asked yesterday

Lucas Newman

New contributor

add a comment |

up vote
-1
down vote

favorite

For example in the documentation for scala for doing PCA is:

"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"

etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca

edited yesterday

desertnaut

15k53060

asked yesterday

Lucas Newman

New contributor

For example in the documentation for scala for doing PCA is:

"val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA().fit(df)"

etc.... the for that is here: https://spark.apache.org/docs/latest/ml-features.html#pca

scala apache-spark apache-spark-ml

edited yesterday

desertnaut

15k53060

asked yesterday

Lucas Newman

New contributor

edited yesterday

desertnaut

15k53060

asked yesterday

Lucas Newman

New contributor

edited yesterday

desertnaut

15k53060

edited yesterday

desertnaut

15k53060

edited yesterday

desertnaut

15k53060

asked yesterday

Lucas Newman

New contributor

asked yesterday

Lucas Newman

asked yesterday

Lucas Newman

New contributor

Lucas Newman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

What you are looking for is the vector assembler transformer which takes an array of data frame columns and produces a single vector column and then you can use an ML pipeline with the assembler and PCA.

Help docs are here

vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html

If you need more than PCA you can use low-level RDD transformations.

edited 4 hours ago

sramalingam24

6741614

answered yesterday

ookboy24

563

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53219230%2fconverting-dataframe-columns-to-vectors-in-spark%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

Help docs are here

vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html

If you need more than PCA you can use low-level RDD transformations.

edited 4 hours ago

sramalingam24

6741614

answered yesterday

ookboy24

563

add a comment |

up vote
1
down vote

Help docs are here

vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html

If you need more than PCA you can use low-level RDD transformations.

edited 4 hours ago

sramalingam24

6741614

answered yesterday

ookboy24

563

add a comment |

up vote
1
down vote

Help docs are here

vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html

If you need more than PCA you can use low-level RDD transformations.

edited 4 hours ago

sramalingam24

6741614

answered yesterday

ookboy24

563

Help docs are here

vector assembler: https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

ml pipeline: https://spark.apache.org/docs/latest/ml-pipeline.html

If you need more than PCA you can use low-level RDD transformations.

edited 4 hours ago

sramalingam24

6741614

answered yesterday

ookboy24

563

edited 4 hours ago

sramalingam24

6741614

edited 4 hours ago

sramalingam24

6741614

edited 4 hours ago

sramalingam24

6741614

answered yesterday

ookboy24

563

answered yesterday

ookboy24

563

answered yesterday

ookboy24

563

add a comment |

Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Lucas Newman is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Vtyjkyuk