!poetry run python scraper_example.py 'in.indeed.com' -k 'software engineer' -l 'Remote' -r 4 -n 1 --sort_by_date
Saved to DB!
This is an example usage of the job scraper with a database like DuckDB to persist scraped information for future use and analysis.
Source: hh-13/jobs-scraper@Github
!poetry run python scraper_example.py 'in.indeed.com' -k 'software engineer' -l 'Remote' -r 4 -n 1 --sort_by_date
Saved to DB!
import duckdb
= duckdb.connect("jobs.db") con
<duckdb.duckdb.DuckDBPyConnection at 0x7bda3f5dfd30>
"SHOW TABLES;") con.sql(
┌──────────┐
│ name │
│ varchar │
├──────────┤
│ JOBS │
│ SEARCHES │
└──────────┘
SEARCHES
table stores the Search sessions."SELECT * FROM SEARCHES;").pl() con.sql(
SEARCH_ID | SEARCH_TERM | URL | LOCATION | REMOTE | JOB_TYPE | PAY | COMPANY | JOB_LANGUAGE |
---|---|---|---|---|---|---|---|---|
i32 | str | list[str] | list[str] | list[str] | list[str] | list[str] | list[str] | list[str] |
1 | "data" | ["https://in.indeed.com/jobs?q=data&l=Remote&sort=date&start=0", "https://in.indeed.com/jobs?q=data&l=Remote&sort=date&start=10", "https://in.indeed.com/jobs?q=data&l=Remote&sort=date&start=20"] | ["Remote (596)"] | ["Remote (596)", "Hybrid work (3)"] | ["Full-time (427)", "Contract (76)", … "Fresher (7)"] | ["₹ 37,500.00+/month (501)", "₹ 67,500.00+/month (398)", … "₹ 1,28,333.34+/month (102)"] | ["Nagarro (47)", "MNJ Software (19)", … "Syneos - Clinical and Corporate - Prod (10)"] | ["English (596)"] |
DuckDB
allows us to easily break down and parse this information."SELECT UNNEST(JOB_TYPE) AS JOB_TYPES FROM SEARCHES WHERE SEARCH_ID=1;") con.sql(
┌──────────────────┐
│ JOB_TYPES │
│ varchar │
├──────────────────┤
│ Full-time (427) │
│ Contract (76) │
│ Temporary (75) │
│ Part-time (37) │
│ Internship (11) │
│ Fresher (7) │
└──────────────────┘
JOBS
table contains the job details."SELECT * FROM JOBS LIMIT 5;") con.sql(
┌────────┬───────────┬──────────────────────┬──────────────────────┬───────────────────────────────────────────────────┐
│ JOB_ID │ SEARCH_ID │ TITLE │ URL │ DESCRIPTION │
│ int32 │ int32 │ varchar │ varchar │ varchar │
├────────┼───────────┼──────────────────────┼──────────────────────┼───────────────────────────────────────────────────┤
│ 1 │ 1 │ Engagement & Data … │ https://in.indeed.… │ Engagement & Data Specialist\n=================… │
│ 2 │ 1 │ Digital Marketing … │ https://in.indeed.… │ Digital Marketing Specialist\n=================… │
│ 3 │ 1 │ Marketing Associate │ https://in.indeed.… │ Marketing Associate\n===================\n\nInd… │
│ 4 │ 1 │ Research Analyst (… │ https://in.indeed.… │ Research Analyst (WFH Relevant candidates only,… │
│ 5 │ 1 │ Lead AI / ML / Dat… │ https://in.indeed.… │ Lead AI / ML / Data Science Engineer - India\n=… │
└────────┴───────────┴──────────────────────┴──────────────────────┴───────────────────────────────────────────────────┘
<a />
) are stripped out from job descriptions for easier usage with other systems, like LLMs, for extracting more granular details.from IPython.display import display, Markdown
display(
Markdown(
con.sql("SELECT len(DESCRIPTION) DESC_LEN, DESCRIPTION FROM JOBS ORDER BY DESC_LEN LIMIT 1;"
1]
).fetchone()[
) )
Here’s how the job details align with your profile.### Pay
₹96,112.53 - ₹1,15,748.42 a year ### Job type
Full-time ### Shift and schedule
Day shift
Monday to Friday Location ——–
Job Type: Full-time
Pay: ₹96,112.53 - ₹115,748.42 per year
Schedule:
Experience:
Work Location: Remote
Application Deadline: 27/05/2024
Expected Start Date: 27/05/2024
Report job
# Finally, don't forget to close the connection!
con.close()
What Next?
The possibilites are endless!
Keep Coding!